Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x. #2105

Closed
DennisCologne opened this issue Jun 26, 2018 · 37 comments
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@DennisCologne
Copy link

DennisCologne commented Jun 26, 2018

Hello there,

Maybe you can help me out with this real quick. I cannot run any of your examples. Not the one from https://radimrehurek.com/gensim/similarities/docsim.html, nor the one from this repo. All of them give me the following Assertion.

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x.

This is not working (other similaritiy measures of this module work fine):

from gensim.test.utils import common_texts
from gensim.corpora import Dictionary
from gensim.models import Word2Vec
from gensim.similarities import SoftCosineSimilarity

model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
dictionary = Dictionary(common_texts)
bow_corpus = [dictionary.doc2bow(document) for document in common_texts]

similarity_matrix = model.wv.similarity_matrix(dictionary)  # construct similarity matrix
index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)

# Make a query.
query = 'graph trees computer'.split()
# calculate similarity between query and each doc from bow_corpus
sims = index[dictionary.doc2bow(query)]

Neither is this from the repo (I followed all previous steps):

similarity = softcossim(sentence_obama, sentence_orange, similarity_matrix)
print('similarity = %.4f' % similarity)

Thanks in advance. I am trying to run this for two days now but nothing works.

Best,
Dennis

@piskvorky
Copy link
Owner

@Witiko can you have a look?

@Witiko
Copy link
Contributor

Witiko commented Jun 26, 2018

Hey @DennisCologne,

sorry to say I am the author of the code that gives you trouble. What Gensim and Python versions are you using? I can run the above code without issue with the PyPI version of Gensim (3.4.0), and Python 3.5 just fine.

>>> sims
[(6, 0.8305764039419705),
 (7, 0.7257781024707816),
 (5, 0.5584027708699971),
 (0, 0.43455470767273646),
 (8, 0.4082457402348116),
 (1, 0.3028528215099456),
 (3, 0.09251811314306692),
 (4, 0.07636744554253587),
 (2, 0.04509321490371689)]

@DennisCologne
Copy link
Author

Hi @Witiko,

thank you for your answer.

Actually, it is Python 2.7.14 with Gensim 3.4.0... after further investigation, the matrix-vector multiplication returns a negative value even though all of the values in both are positive.

But you are right, I just tried it on my Python 3.6 environment and there it works fine.
I guess I will use this environment than. But this problem might still be interesting for you.

Thanks again for the quick reply.

Best,
Dennis

@Witiko
Copy link
Contributor

Witiko commented Jun 28, 2018

Hey @DennisCologne,

this is definitely interesting, but I can't seem to reproduce your problem even with Python 2.7 and Gensim 3.4.0. Can you find a pair of document vectors vec1, and vec2 that trigger the issue, call softcossim(vec1, vec2), and share what the content of vec1, vec2, dense_matrix, vec1len, and vec2len is just before the failing assertion?

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018
@menshikh-iv
Copy link
Contributor

ping @DennisCologne, please provide information for reproducing an error (that requested in #2105 (comment))

@menshikh-iv
Copy link
Contributor

ping @DennisCologne

@tvrbanec
Copy link

Similar issue with SoftCosineSimilarity.
Please check at https://groups.google.com/forum/#!topic/gensim/WVTRdZONtrc
Python2.7, gensim 3.7

@piskvorky
Copy link
Owner

ping @Witiko

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

I fail to see how this is related to the current issue, which should have been long closed due to the original poster's inactivity and the migration of the related code in Gensim 3.7.

@tvrbanec
Copy link

Assertion Error + SoftCosineSimilarity = Not related?
I will present the full code if You'll try to resolve the issue. Do you prefer that I open a new issue?

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

The assertion error in this issue is supposed to come from the code in the pre-3.7 softcossim method, which used to reside in gensim.matutils and has since moved to the gensim.similarities.termsim module. Your issue is with the gensim.models.keyedvectors module.

@tvrbanec
Copy link

tvrbanec commented Jan 25, 2019

def softcosinesim(texts):
    model = Word2Vec(texts, size=20, min_count=1)  # train word-vectors
    termsim_index = WordEmbeddingSimilarityIndex(model)
    dictionary = Dictionary(texts)
    bow_corpus = [dictionary.doc2bow(document) for document in texts]
    similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
    docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
    sims = docsim_index[bow_corpus]  # calculate similarity of query to each doc from bow_corpus
    return sims

Traceback (most recent call last):
termsim_index = WordEmbeddingSimilarityIndex(model)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1389, in init
assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors)
AssertionError

What is wrong with this code that SoftCosineSimilarity doesn't like it? I tried to follow tutorial...

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

For some reason, your word embeddings do not have the WordEmbeddingsKeyedVectors type. What type do they have?

@tvrbanec
Copy link

I am using gensim Word2Vec to generate w2v_model.

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

Your issue above can be resolved by calling WordEmbeddingSimilarityIndex(model.wv) instead of WordEmbeddingSimilarityIndex(model). I will update the code, so that it is more aware of the distinction between BaseAny2VecModel (model) and WordEmbeddingsKeyedVectors (model.wv).

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

I cannot reproduce your other issue, i.e. model.wv.similarity_matrix throwing a TypeError:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> dictionary = Dictionary(common_texts)
>>> model.wv.similarity_matrix(dictionary)
<12x12 sparse matrix of type '<type 'numpy.float32'>'
        with 68 stored elements in Compressed Sparse Column format>

Can you run the above code without issue?

@tvrbanec
Copy link

Can you run the above code without issue?

Yes, I can.

@tvrbanec
Copy link

tvrbanec commented Jan 25, 2019

Now, few steps forward, for:
similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary)
I got:
NameError: global name 'TermSimilarityMatrix' is not defined

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

Please, try the following:

>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> 
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

@tvrbanec
Copy link

similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in init
for term, similarity in index.most_similar(t1, num_rows)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1401, in most_similar
for t2, similarity in most_similar:
TypeError: 'numpy.float32' object is not iterable

@tvrbanec
Copy link

Maybe the problem is creating by terms like 'chemical_element' or 'cabinet_minister' with underlines?

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

I cannot reproduce your issue with new embeddings:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> 
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
>>> similarity_matrix
<gensim.similarities.termsim.SparseTermSimilarityMatrix object at 0x7f822abc3d10>

Judging by the error message, model.wv.most_similar returns a number, not an iterable. Can you print the result of model.wv.most_similar(positive=['chemical_element'], topn=2), please?

@tvrbanec
Copy link

For common_texts, output is:

model.wv.most_similar(positive=['chemical_element'], topn=2)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-13a509737ea2> in <module>()
----> 1 model.wv.most_similar(positive=['chemical_element'], topn=2)

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
    541                 mean.append(weight * word)
    542             else:
--> 543                 mean.append(weight * self.word_vec(word, use_norm=True))
    544                 if word in self.vocab:
    545                     all_words.add(self.vocab[word].index)

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
    462             return result
    463         else:
--> 464             raise KeyError("word '%s' not in vocabulary" % word)
    465 
    466     def get_vector(self, word):

KeyError: "word 'chemical_element' not in vocabulary"

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

Can you please try with the embeddings that throw the TypeError: 'numpy.float32' object is not iterable exception? I understand that these should contain an embedding for the word chemical_element.

@tvrbanec
Copy link

For my text it stops even before:

In [12]: similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-abfb8b1569f4> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.pyc in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
    232             most_similar = [
    233                 (dictionary.token2id[term], similarity)
--> 234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]
    236 

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, t1, topn)
   1399         else:
   1400             most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1401             for t2, similarity in most_similar:
   1402                 if similarity > self.threshold:
   1403                     yield (t2, similarity**self.exponent)

TypeError: 'numpy.float32' object is not iterable

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

As you can see on line 1400 in the error message above, SparseTermSimilarityMatrix calls model.wv.most_similar internally. According to the error message, the result of calling model.wv.most_similar is a float, not an iterable. This is highly suspect.

Therefore, can you please print the result of model.wv.most_similar(positive=['chemical_element'], topn=2) instead of calling the SparseTermSimilarityMatrix constructor? As you noted, there is no issue when you construct the model using common_texts, so this seems to be an issue with your embeddings.

@tvrbanec
Copy link

Thank you for your patience: :)

In [13]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[13]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

This seems pretty iterable to me.

@tvrbanec
Copy link

tvrbanec commented Jan 25, 2019

Does my text make an error at your computer?

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

Let's try to closely imitate the call on line 1400. Can you please print the result of the following:

>>> termsim_index.kwargs
>>> termsim_index.keyedvectors
>>> most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)
>>> most_similar
>>> type(most_similar)
>>> '__iter__' in most_similar

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

My text does not make an error at your computer?

What is your text? Nevermind, I see it now.

@tvrbanec
Copy link

In [15]: termsim_index.kwargs
Out[15]: {}

In [16]: termsim_index.keyedvectors
Out[16]: <gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f18fca001d0>

In [17]: most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)

In [18]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[18]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]

In [19]: most_similar
Out[19]: 
[('inhabitant', 0.93882817029953),
 ('the', 0.9326512813568115),
 ('give', 0.93118816614151),
 ('have', 0.9303354620933533),
 ('act', 0.928438663482666),
 ('one', 0.9224538803100586),
 ('to', 0.9192750453948975),
 ('china', 0.9141452312469482),
 ('associate_degree', 0.9119006395339966),
 ('with', 0.9078292846679688),
 ('be', 0.9045330286026001),
 ('statement', 0.898809552192688),
 ('which', 0.8987339735031128),
 ('vote', 0.89339280128479),
 ('time_period', 0.89242023229599),
 ('of', 0.8907227516174316),
 ('playing_card', 0.8895689249038696),
 ('first', 0.8888296484947205),
 ('oregon', 0.8866477608680725),
 ('merkel', 0.8860795497894287),
 ('person', 0.8851599097251892),
 ('from', 0.8846421241760254),
 ('in', 0.8816125988960266),
 ('and', 0.8810371160507202),
 ('this', 0.8801096081733704),
 ('make', 0.8777452707290649),
 ('meet', 0.8769802451133728),
 ('besides', 0.8752848505973816),
 ('angular_distance', 0.873124361038208),
 ('that', 0.8714672327041626),
 ('on', 0.8699379563331604),
 ('other', 0.8691580891609192),
 ('change', 0.8684202432632446),
 ('obama', 0.8667253851890564),
 ('communication', 0.8621015548706055),
 ('engineering', 0.8615524172782898),
 ('some', 0.8598195314407349),
 ('now', 0.8572754859924316),
 ('exchange', 0.8560868501663208),
 ('for', 0.8554658889770508),
 ('title', 0.8532639741897583),
 ('express', 0.8532208204269409),
 ('right', 0.8518909811973572),
 ('head_of_state', 0.847177267074585),
 ('free', 0.846038281917572),
 ('remove', 0.8458209037780762),
 ('germany', 0.8454596996307373),
 ('union', 0.8446109294891357),
 ('would', 0.8416316509246826),
 ('faculty', 0.8411930799484253),
 ('weekday', 0.8399801850318909),
 ('merely', 0.8379250764846802),
 ('we', 0.8371882438659668),
 ('political_unit', 0.8370255827903748),
 ('work', 0.8348655104637146),
 ('take', 0.8348475694656372),
 ('administrative_district', 0.8343826532363892),
 ('tpp', 0.833882749080658),
 ('administrator', 0.8318067789077759),
 ('united_nations_agency', 0.8316440582275391),
 ('washington', 0.8313312530517578),
 ('politician', 0.8289576768875122),
 ('legislature', 0.8287457227706909),
 ('plan_of_action', 0.8201491832733154),
 ('management', 0.8187181949615479),
 ('federal', 0.8167140483856201),
 ('new', 0.8154265880584717),
 ('travel', 0.8148607015609741),
 ('not', 0.8135936856269836),
 ('about', 0.8135201334953308),
 ('republican', 0.8131340742111206),
 ('him', 0.8047671318054199),
 ('by', 0.8038091659545898),
 ('associate', 0.8037841320037842),
 ('activity', 0.8029162287712097),
 ('structure', 0.8025172352790833),
 ('pacific', 0.799057126045227),
 ('point', 0.7987416982650757),
 ('more', 0.7969338893890381),
 ('message', 0.7965559959411621),
 ('organization', 0.7899693250656128),
 ('digit', 0.7889872789382935),
 ('connect', 0.7889586687088013),
 ('when', 0.7868154048919678),
 ('result', 0.7862980961799622),
 ('his', 0.7852383852005005),
 ('they', 0.783265233039856),
 ('schulz', 0.7814303636550903),
 ('group_action', 0.7772569060325623),
 ('european', 0.7769173979759216),
 ('large_integer', 0.775283932685852),
 ('under', 0.7743880748748779),
 ('inform', 0.771774172782898),
 ('mexico', 0.7684292793273926),
 ('against', 0.7668302059173584),
 ('steinmeier', 0.7626404762268066),
 ('supply', 0.7593228816986084),
 ('better', 0.7585717439651489),
 ('support', 0.7579919695854187),
 ('change_state', 0.7550258636474609)]

In [20]: type(most_similar)
Out[20]: list

In [21]: '__iter__' in most_similar
Out[21]: False

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

I can reproduce this with your text and I am investigating.

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

The issue is that the most_similar method returns weird results with topn=0:

>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 ,  0.32174808, -0.02938104, -0.21145007,  0.37524396,
       -0.23777878,  0.99999994, -0.01436211,  0.36708638, -0.09770551,
        0.05963777,  0.3810038 ], dtype=float32)

This is an undocumented behavior, which can be fixed by removing lines 554 and 555 in keyedvectors.py. Sadly, I don't see how a caller can easily patch this up without changing the package code. Afterwards, you will get the expected result and, more importantly, SparseTermSimilarityMatrix should now work.

>>> model.wv.most_similar(positive=['computer'], topn=0)
[]

@Witiko
Copy link
Contributor

Witiko commented Jan 25, 2019

The patches are now available in #2356. Thank you for your patience in helping discover the bug and sorry for the trouble. 😉

@Vineet-Sharma29
Copy link

I have following code:-

model = KeyedVectors.load_word2vec_format('/home/vineet/Downloads/lemmatized-legal/no replacement/legal_lemmatized_no_replacement.bin', binary=True)

bow_corpus, doc_dict = corpora.MmCorpus('./bow_corpus.mm'), corpora.Dictionary.load('./doc_dict.dict')

# compute cosine similarity between word embeddings
termsim_index = WordEmbeddingSimilarityIndex(model)

# construct term similarity matrix
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)

And it gives me following error:-

File "word2vec.py", line 25, in <module>
    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)
  File "/home/vineet/.local/lib/python3.6/site-packages/gensim/similarities/termsim.py", line 264, in __init__
    100.0 * matrix.getnnz() / matrix_order**2)
ZeroDivisionError: float division by zero

What can be probable reasons for it and how to resolve it?

@Witiko
Copy link
Contributor

Witiko commented Jul 28, 2020

It seems as though your matrix_order is zero, which would indicate that your doc_dict dictionary is empty, can you verify?
We should check for this and raise a ValueError with a user-friendly message earlier in the constructor.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests

6 participants