Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using bound function in Author Topic model!! #1589

Closed
sa-matiny opened this issue Sep 16, 2017 · 9 comments · Fixed by #2133
Closed

Problem using bound function in Author Topic model!! #1589

sa-matiny opened this issue Sep 16, 2017 · 9 comments · Fixed by #2133
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@sa-matiny
Copy link

sa-matiny commented Sep 16, 2017

Hi
I am trying to use author topic model. When I used bound function to evaluate my model I received this error in python:

Traceback (most recent call last):
  File "C:\Users\Sara\Desktop\E-COM\Final Project\Working on data\at_clustering.py", line 161, in <module>
    test_corpus = test_corpus, test_d2a = test_doc2author, test_a2d = test_author2doc, limit = 25)
  File "C:\Users\Sara\Desktop\E-COM\Final Project\Working on data\at_clustering.py", line 70, in evaluate_k
    pr = np.exp2(-model.bound(test_corpus, doc2author=test_d2a, author2doc=test_a2d)/number_of_words)
  File "C:\Python27\lib\site-packages\gensim\models\atmodel.py", line 835, in bound
    phinorm = self.compute_phinorm(ids, authors_d, expElogtheta[authors_d, :], expElogbeta[:, ids])
IndexError: arrays used as indices must be of integer (or boolean) type

I think the problem is expElogbeta[:, ids] because this is not acceptable in python!

Please test bound function and fix source code.
Thanks

@menshikh-iv
Copy link
Contributor

Hi @sa-matiny,
expElogbeta[:, ids] is correct for numpy arrays.
Please add more information

  • OS/Python/Gensim/Numpy/Scipy versions
  • Code example (for reproducing exception)

@menshikh-iv
Copy link
Contributor

Ping @sa-matiny

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Oct 2, 2017
@sa-matiny
Copy link
Author

Hi,
My system info is windows10/python 2.7.6/numpy 1.13.1/scipy 0.19.1
I just set the eval_every parameter in author topic model True:

train_corpus = corpora.MmCorpus("tweet_preprocessing/train_corpus.mm")
train_dictionary = Dictionary.load("tweet_preprocessing/train_dic.dict")
train_tweets = pd.read_json("tweet_preprocessing/train_tweets.txt")
train_doc2author = dict(zip(train_tweets.index, train_tweets.user_ids))
n_topics = 10
model = AuthorTopicModel(corpus=train_corpus, num_topics=n_topics, id2word=train_dictionary,
                                doc2author=d2a, chunksize=2000, eval_every=True, iterations=10 , serialized=True,
                                serialization_path="clustering_temp2/train_corpus.mm")

At the time that bound() function is called, I got this error:

Traceback (most recent call last):
  File "C:\Users\Sara\Desktop\E-COM\Final Project\Working on data\clustering.py", line 77, in <module>
    n_topics = num_top, d2a = train_doc2author, s_path = "clustering_temp2/train_corpus.mm")
  File "C:\Python27\lib\site-packages\gensim\models\atmodel.py", line 288, in __init__
    self.update(corpus, author2doc, doc2author, chunks_as_numpy=use_numpy)
  File "C:\Python27\lib\site-packages\gensim\models\atmodel.py", line 711, in update
    self.log_perplexity(chunk, chunk_doc_idx, total_docs=lencorpus)
  File "C:\Python27\lib\site-packages\gensim\models\atmodel.py", line 493, in log_perplexity
    perwordbound = self.bound(chunk, chunk_doc_idx, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
  File "C:\Python27\lib\site-packages\gensim\models\atmodel.py", line 835, in bound
    phinorm = self.compute_phinorm(ids, authors_d, expElogtheta[authors_d, :], expElogbeta[:, ids])
IndexError: arrays used as indices must be of integer (or boolean) type

@menshikh-iv
Copy link
Contributor

@sa-matiny OK, let's add more info:

  • Load your datasets to any storage and share link with me
  • Don't forget to define d2a variable (now this is undefined)

This needed for me for reproducing your error.

@sa-matiny
Copy link
Author

sa-matiny commented Oct 3, 2017

This is the sample of small data and also code that got error:
https://drive.google.com/file/d/0Bz1OoooH0zo4Y0VqY0pDaFZpRDg/view?usp=sharing

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills and removed need info Not enough information for reproduce an issue, need more info from author labels Oct 31, 2017
@Ronggui
Copy link

Ronggui commented Jan 10, 2018

I experienced the same error and found that the errors were cause by empty document (because all the words were filtered out).

@menshikh-iv
Copy link
Contributor

Thanks for the information @Ronggui

@probinso
Copy link
Contributor

probinso commented Jul 17, 2018

After running into this problem as well, i have determined a fix for this. I'm currently reading through the contributions guide to submit a conformant pull request.

This error is being thrown because numpy.array([]) defaults to dtype=np.float instead of dtype=np.integer. The contents of the array are used as type hints, but an empty array doesn't provide this context.

probinso added a commit to probinso/gensim that referenced this issue Jul 18, 2018
probinso added a commit to probinso/gensim that referenced this issue Jul 18, 2018
probinso added a commit to probinso/gensim that referenced this issue Jul 20, 2018
probinso added a commit to probinso/gensim that referenced this issue Jul 20, 2018
@probinso
Copy link
Contributor

Finally passing all checks, was a little confused about one of Circle checks

menshikh-iv pushed a commit that referenced this issue Aug 2, 2018
* test for #1589

* bugfix #1589

* ignore unused assigned varaible

* PR review

* Update test_atmodel.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants