Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

similarity_matrix: suspicous column indexing logic when tfidf model supplied #1960

Closed
nataliedurgin opened this issue Mar 7, 2018 · 5 comments
Labels
difficulty easy Easy issue: required small fix

Comments

@nataliedurgin
Copy link

nataliedurgin commented Mar 7, 2018

Description

Expected gensim.models.keyedvectors WordEmbeddingsKeyedVectors property similarity_matrix to respect the reordering of terms dictated by a supplied tfidf model. This does not seem to be the case.

Steps/Code/Corpus to Reproduce

This will be an unconventional bug report because I did not find this bug "in the wild". In adapting the code for another purpose, I noticed some logic which might be problematic. I will

  • point out where I think the mistake lies in the original context
  • show what corrected the problem in my context
  • provide the test case that demonstrates correctness in my context
Original Context:

https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L516

        for row_number, w1_index in enumerate(word_indices):
            ...
            w1 = dictionary[w1_index]
            ...
            # Traverse upper triangle columns.
            if matrix_order <= nonzero_limit + 1:  # Traverse all columns.
                columns = (
                    (w2_index, self.similarity(w1, dictionary[w2_index]))
                    for w2_index in range(w1_index + 1, matrix_order)
                    if w1_index != w2_index and dictionary[w2_index] in self.vocab)
            ...
            for w2_index, similarity in columns:
                # Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
                if similarity > threshold and matrix_nonzero[w2_index] <= nonzero_limit:
                    element = similarity**exponent
                    matrix[w1_index, w2_index] = element
                    matrix_nonzero[w1_index] += 1
                    matrix[w2_index, w1_index] = element
                    matrix_nonzero[w2_index] += 1

A red flag for me was that while we consider the "word_indices" mapping for the row index, we are never considering it for the column index. Even in the case of rows, notice that we are filling "matrix" at the end without regard for the reordering that was supposed to have occurred.

A Correction in Levenshtein Context:
for row_number, w1_index in enumerate(word_indices):
...
    # Traverse upper triangle columns.
    # TODO: determine community preference for pylev.levenshtein or distance.levenshtein
    import pylev
    columns = []
    for col_number in range(row_number + 1, matrix_order):
        if row_number != col_number:
            w2_index = word_indices[col_number]
            w1 = dictionary[w1_index]
            w2 = dictionary[w2_index]
            similarity = pylev.levenshtein(w1, w2)/\
                         float(max(len(w1), len(w2)))
            columns.append((col_number, similarity))

    for col_number, similarity in columns:
        element = alpha * (1 - similarity) ** beta
        matrix[row_number, col_number] = element
        matrix[col_number, row_number] = element
Test Case

Consider the mini corpus:

mini_texts = [['abc'], ['lab', 'abc'], ['bad', 'abc']]
mini_dict = gensim.corpora.Dictionary(mini_texts)
mini_corpus = [mini_dict.doc2bow(text) for text in mini_texts]
mini_tfidf = gensim.models.TfidfModel(mini_corpus)
  • The default ordering of terms is in order of appearance: 'abc', 'lab', 'bad'.
  • The Levenshtein similarity matrix returned without supplying tfidf will be:
matrix([[1.        , 0.00740741, 0.        ],
       [0.00740741, 1.        , 0.00740741],
       [0.        , 0.00740741, 1.        ]], dtype=float32)


However, since 'abc' appears in every document, the tfidf coefficient will be zero. The ordering with respect to the tfidf weight will be: 'lab', 'bad', 'abc'. The default Levenshtein similarity scores are pretty symmetric, there are some reorderings for which the matrix will be the same. However, this example is contrived to so that the tfidf reordering will break the symmetry of the original Levenshtein similarity scores:

matrix([[1.        , 0.00740741, 0.00740741],
        [0.00740741, 1.        , 0.        ],
        [0.00740741, 0.        , 1.        ]], dtype=float32)

We see that the revised logic correctly returns the reordered scores. The original logic returned the first matrix when the tfidf model was supplied to the function. A test case for the similarity_matrix function would need to be constructed so that a reordering of terms would lead to a definitive reordering of the relevant scores in the resulting matrix.

Conclusion

While I have not taken the time to prove definitively that there is a bug in the original context, a unit test should be written to cover the case of a supplied tfidf reordering the terms in similarity_matrix. This is a high-risk piece of logic. If there is a bug, and a supplied tfidf isn't being incorporated properly, eventually SoftCosineSimilarity will return nonsense scores when fed documents transformed by the same tfidf model.

Relevant Code

similarity_matrix:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/models/keyedvectors.py#L440

existing unit tests:
https://github.com/RaRe-Technologies/gensim/blob/f9669bb8a0b5b4b45fa8ff58d951a11d3178116d/gensim/test/test_keyedvectors.py#L30

See also related: #1961

Versions

Darwin-15.6.0-x86_64-i386-64bit
('Python', '2.7.13 (default, Apr 4 2017, 08:46:44) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
('NumPy', '1.14.1')
('SciPy', '1.0.0')
('gensim', '3.4.0')
('FAST_VERSION', 0)

@Witiko
Copy link
Contributor

Witiko commented Mar 8, 2018

Thank you for the report. So far, this does not seem to be a bug to me. Several things to note:

Even the documentation specifies that “[…] The rows of the term similarity matrix will be build in an increasing order of importance of terms […]”. The order in which columns are processed is not affected.

@nataliedurgin
Copy link
Author

It is entirely possible I am misunderstanding the purpose of the option of providing a tfidf model into this function. Let me offer a test case (this time in context) that highlights what believe its purpose to be.

import gensim
mini_texts = [['ab'], ['abc', 'ab'], ['bcd', 'ab']]
mini_dict = gensim.corpora.Dictionary(mini_texts)
mini_corpus = [mini_dict.doc2bow(text) for text in mini_texts]
mini_tfidf = gensim.models.TfidfModel(mini_corpus)
w2v_cbow = gensim.models.Word2Vec(mini_texts, sg=1, min_count=1)
w2v_cbow.wv.similarity_matrix(mini_dict, threshold=0.0, exponent=1.0).todense()
Out[365]: 
matrix([[1.       , 0.0664964, 0.       ],
        [0.0664964, 1.       , 0.       ],
        [0.       , 0.       , 1.       ]], dtype=float32)
w2v_cbow.wv.similarity_matrix(mini_dict, tfidf=mini_tfidf, threshold=0.0, exponent=1.0).todense()
Out[366]: 
matrix([[1.       , 0.0664964, 0.       ],
        [0.0664964, 1.       , 0.       ],
        [0.       , 0.       , 1.       ]], dtype=float32)

My understanding is that a document vector, in this case, is a 3-dimensional vector, and the elements in the vector are values that are ordered according to the term order ['ab', 'abc', 'bcd']. Any document vector in the tfidf transformed corpus will be represented by 3-dimensional vector of tfidf coefficients that take their place with respect to the new tfidf weighted term order ['abc', 'bcd', 'ab'].

With respect to the default ordering I would expect:

'ab' 'abc' 'bcd'
'ab' 1 # 0
'abc' # 1 0
'bcd' 0 0 1

With respect to the tfidf ordering I would expect:

'abc' 'bcd' 'ab'
'abc' 1 0 #
'bcd' 0 1 0
'ab' # 0 1

The current function returns the same matrix both times. Won't this cause term mismatches when computing soft cosine similarity with tfidf transformed documents?

@nataliedurgin
Copy link
Author

Thank you for your patience. I see now I had a misconception about the gensim tfidf transformation reordering features. I am not sure where that originated, but I am very glad to have it corrected!

Just to make sure I understand, the tfidf parameter here is not meant to restructure the output in any way, it is supplied as a means to improve performance?

@Witiko
Copy link
Contributor

Witiko commented Mar 8, 2018

That is correct. When the matrix is empty, we can take a row and just fill all columns corresponding to the closest terms in the embedding space. When the matrix is already half-filled, some columns in a row (that do not necessarily correspond to the closest terms) will already be pre-filled (due to the symmetric matrix[w2_index, w1_index] = element assignment) and some columns corresponding to the closest terms will not be fillable (due to the if … and matrix_nonzero[w2_index] <= nonzero_limit: constraint).

The role of the tfidf parameter is to process the rows that correspond to important terms before the matrix has filled up; experiments suggest this improves performance compared to an arbitrary row processing order. Other algorithms for pruning the matrix (such as disregarding symmetry and just taking nonzero_limit closest terms in each row) are possible, but unimplemented. Not pruning the matrix is also an option with small datasets.

@Witiko
Copy link
Contributor

Witiko commented Mar 8, 2018

However, the documentation should be reworded, since the rows are processed in decreasing, not increasing, order of importance. I will make a commit that clarifies the documentation and closes this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

3 participants