Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse support when num_best not None in interfaces. Fixes #1294 #1321

Merged
merged 16 commits into from
Jun 22, 2017

Conversation

manneshiva
Copy link
Contributor

Fixes #1294. Added a new function ( any2sparse_clipped() ), in matutils by extending any2sparse() to return the topn elements of greatest magnitude. Also, replaced full2sparse_clipped in interfaces.py to any2sparse_clipped to handle sparse cases when num_best isn't None.
I am fairly new to contributing to open source and so do not have much idea about testing my commits. I have only tested/verified against the code to reproduce the issue and it seemed to be working. Let me know in case I need to do anything else.

@manneshiva
Copy link
Contributor Author

oops, forgot to add a whitespace after comma.

@menshikh-iv
Copy link
Contributor

Thank you @manneshiva, at first, please write unit-test for check that there is no exception (you can rewrite this code from issue as unittest) and write test for any2sparse_clipped for checking output (with sparse and non-sparse inputs)

@manneshiva
Copy link
Contributor Author

manneshiva commented May 15, 2017

@menshikh-iv Thanks for your suggestions. I will write the unit-test asap. Regarding this comment, it is handled for dense inputs. For other(non-full) cases, we have two options:

  1. Converting the input vector to sparse and then sorting these 2-tuples(implemented in current commit). (or)
  2. Creating a full array from input vector and then proceeding like full2sparse_clipped().

I chose method (1), because I feel that in case of sparse vectors, length of the sparse 2-tuples << num_documents and thus sorting these tuples directly will take lesser time(which isn't the case with full inputs and therefore handled separately). Does this make sense? Would you rather recommend method (2) ?

if isinstance(vec, np.ndarray):
return full2sparse_clipped(vec, topn, eps)
vec_sparse = any2sparse(vec, eps)
return sorted(vec_sparse, key=lambda x: x[1], reverse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only return topn

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

Could you please benchmark the performance of 1 and 2 for, say, 1% of non-zero elements in a 10k long array?

@manneshiva
Copy link
Contributor Author

@menshikh-iv @tmylk In the process of re-writing the code from the issue for the unit test, I realized a fault in my/our understanding of the problem.

def testMaintainSparsity(self):
        """Sparsity is correctly maintained when maintain_sparsity=True"""
        num_features = len(dictionary)

        index = self.cls(corpus, num_features=num_features)
        dense_sims = index[corpus]

        index = self.cls(corpus, num_features=num_features, maintain_sparsity=True)
        sparse_sims = index[corpus]

        self.assertFalse(scipy.sparse.issparse(dense_sims))
        self.assertTrue(scipy.sparse.issparse(sparse_sims))
        numpy.testing.assert_array_equal(dense_sims, sparse_sims.todense())

As seen from this unit test, 'maintain_sparsity' indicates the type of the output(whether scipy sparse or not) and has nothing to do with the type of the input. In the case when num_best is not None, the output is expected to be a list of reverse sorted tuples(index,similarity). So basically, the problem is not the sparse input but rather asking for a sparse output when num_best=some_n. It does not make sense to return a scipy sparse matrix in the case when num_best is not None and maintain_sparsity is True, which is exactly what was assumed in the present code. The fix for this problem is to simply raise an error when trying to set maintain_sparsity to True and num_best=n. Should have caught this before. Or am I missing something?

@tmylk
Copy link
Contributor

tmylk commented May 17, 2017

The goal of this issue was to support num_best is not None and maintain_sparsity=True case. You see that there is a problem with this. Could you please explain better what that problem is?

@manneshiva
Copy link
Contributor Author

manneshiva commented May 17, 2017

@tmylk Suppose we were to support num_best is not None and maintain_sparsity=True , what should be the output? To maintain sparity, is it sufficient to return a list of scipy.sparce.csr rows(i.e. list of 1xn sparse rows). For clarity, I am posting the output of the code that I have written:
Output of present gensim code for num_best=3 and maintain_sparsity=False
[[(0, 0.99999994039535522), (2, 0.28867512941360474), (1, 0.2357022613286972)], [(1, 1.0), (4, 0.70710676908493042), (2, 0.40824830532073975)], [(2, 1.0), (3, 0.61237245798110962), (1, 0.40824830532073975)], [(3, 1.0), (2, 0.61237245798110962), (1, 0.3333333432674408)], [(4, 0.99999994039535522), (1, 0.70710676908493042), (2, 0.28867512941360474)], [(5, 1.0), (6, 0.70710676908493042), (7, 0.57735025882720947)], [(6, 0.99999994039535522), (7, 0.81649655103683472), (5, 0.70710676908493042)], [(7, 0.99999994039535522), (6, 0.81649655103683472), (8, 0.66666662693023682)], [(8, 0.99999994039535522), (7, 0.66666662693023682), (6, 0.40824827551841736)]]
----------------------------------------------------------------------------------
Output of my current code for num_best=3 and maintain_sparsity=True

[<1x3 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x5 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x4 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x4 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x5 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x8 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x8 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x9 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x9 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>]

----------------------------------------------------------------------------------
Content of this list of sparse rows:
[[(0, 0.99999994), (2, 0.28867513), (1, 0.23570226)], [(1, 1.0), (4, 0.70710677), (2, 0.40824831)], [(2, 1.0), (3, 0.61237246), (1, 0.40824831)], [(3, 1.0), (2, 0.61237246), (1, 0.33333334)], [(4, 0.99999994), (1, 0.70710677), (2, 0.28867513)], [(5, 1.0), (6, 0.70710677), (7, 0.57735026)], [(6, 0.99999994), (7, 0.81649655), (5, 0.70710677)], [(7, 0.99999994), (6, 0.81649655), (8, 0.66666663)], [(8, 0.99999994), (7, 0.66666663), (6, 0.40824828)]]

The sparse output is the same as the dense output as seen above. But it is not possible to convert this list of sparse rows to a sparse matrix since the shape of each row is different(depends on the highest index in topn indices). Is it ok to return a list of sparse rows? What is the expected output?

@manneshiva
Copy link
Contributor Author

@tmylk Found a way to return a sparse csr_matrix (possible even with variable shape of each row) instead of list of sparse rows as mentioned. But it loses the ordering of the values in each row. zip(row[i].indices, row[i].data) gives a different order rather than a decreasing similarity value order. Once you confirm on the required output, I shall immediately commit my changes. Thanks!

@manneshiva
Copy link
Contributor Author

@tmylk Tried and tested a way to create a sparse scipy matrix which preserves the order. So the current output will be a scipy sparse order preserving matrix. Will code it and push the changes soon. Thanks for your patience.

@manneshiva
Copy link
Contributor Author

@tmylk I have fixed the issue with the following changes:

  1. Replaced any2sparse_clipped with scipy2scipy_clipped(function for clipping topn from a scipy matrix).
  2. Added new code path for the case when maintain_sparse = True and num_best is not None.
  3. Added unit tests for (1) and (2).

The output type (for maintain_sparse = True and num_best is not None) is now consistent with the output when maintain_sparse = True and num_best is None. I have also tested against the code in the issue and compared the output of sparse and non sparse when num_best is not None. The issue looks resolved. Please review and let me know if any changes are needed. Thanks.

@manneshiva
Copy link
Contributor Author

@tmylk Let me know if any changes are required.

"""Tests that sparsity is correctly maintained when maintain_sparsity=True and num_best is not None"""
num_features = len(dictionary)

index = self.cls(corpus, num_features=num_features, num_best=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit maintain_sparsity=False will add clarity.

@tmylk
Copy link
Contributor

tmylk commented May 29, 2017

@manneshiva I am concerned about the call to matrix.data as it creates a numpy array in memory and thus the benefits of having sparse format disappear.

Please preserve sparsity when selecting num_best

@@ -222,6 +222,12 @@ def __getitem__(self, query):
if self.num_best is None:
return result

# if maintain_sparity is True, result is scipy sparse. Sort, clip the
# topn and return as a scipy sparse matrix.
if hasattr(self, 'maintain_sparsity'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getattr accepts a default value.

@@ -166,6 +166,39 @@ def any2sparse(vec, eps=1e-9):
return [(int(fid), float(fw)) for fid, fw in vec if np.abs(fw) > eps]


def scipy2scipy_clipped(matrix, topn, eps=1e-9):
"""
Return a scipy.sparse vector/matrix consisting of 'topn' elements of greatest magnitude(abs).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the greatest magnitude (absolute value).

# Return clipped sparse vector if input is a sparse vector.
if matrix.shape[0] == 1:
# use np.argpartition/argsort and only form tuples that are actually returned.
biggest = argsort(abs(matrix).data, topn, reverse=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abs(matrix.data) should be much faster than abs(matrix).data (numpy vs scipy).

matrix_indptr = [0]
for v in matrix:
# Sort and clip each row vector first.
biggest = argsort(abs(v).data, topn, reverse=True)
Copy link
Owner

@piskvorky piskvorky May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtto.

Even faster, but slightly hairier, to call abs just once, on the entire matrix data, rather than once per row.

@manneshiva
Copy link
Contributor Author

@tmylk cc @piskvorky Will push the required changes soon.

@manneshiva I am concerned about the call to matrix.data as it creates a numpy array in memory and thus the benefits of having sparse format disappear.
Please preserve sparsity when selecting num_best

I am not sure about what you mean. matrix.data returns all nonzero values and not the full length(#docs) array thus still preserves sparsity. Am I missing something?

@menshikh-iv
Copy link
Contributor

Please try run code from issue and fix this problem @manneshiva

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-57-3e2df5bae673> in <module>()
      6 index = SparseMatrixSimilarity(Sparse2Corpus(X, documents_columns=False), num_features=X.shape[1], num_terms=X.shape[0], maintain_sparsity=True, num_best=100)
      7 
----> 8 for s in index:
      9     print s, s.data

/home/ivan/.virtualenvs/cl/local/lib/python2.7/site-packages/gensim/interfaces.pyc in __iter__(self)
    269                 chunk_end = min(self.index.shape[0], chunk_start + self.chunksize)
    270                 chunk = self.index[chunk_start : chunk_end]
--> 271                 for sim in self[chunk]:
    272                     yield sim
    273         else:

/home/ivan/.virtualenvs/cl/local/lib/python2.7/site-packages/gensim/interfaces.pyc in __getitem__(self, query)
    226         # topn and return as a scipy sparse matrix.
    227         if getattr(self, 'maintain_sparsity', False):
--> 228                 return matutils.scipy2scipy_clipped(result, self.num_best)
    229 
    230         # if the input query was a corpus (=more documents), compute the top-n

/home/ivan/.virtualenvs/cl/local/lib/python2.7/site-packages/gensim/matutils.pyc in scipy2scipy_clipped(matrix, topn, eps)
    193             # Sort and clip each row vector first.
    194             biggest = argsort(matrix_abs[matrix.indptr[i]:matrix.indptr[i + 1]], topn, reverse=True)
--> 195             indices, data = v.indices.take(biggest), v.data.take(biggest)
    196             # Store the topn indices and values of each row vector.
    197             matrix_data.append(data)

IndexError: cannot do a non-empty take from an empty axes.

@manneshiva
Copy link
Contributor Author

@menshikh-iv Made required changes and tested on code from issue (maintain_sparsity = True , num_best=100). Also checked correctness by comparing output with maintain_sparsity = False , num_best=100. The issue looks resolved. Please confirm and let me know.
`

@menshikh-iv
Copy link
Contributor

Thank you @manneshiva

@menshikh-iv menshikh-iv merged commit dfd7da4 into piskvorky:develop Jun 22, 2017
saparina pushed a commit to saparina/gensim that referenced this pull request Jul 9, 2017
…one. Fix piskvorky#1294 (piskvorky#1321)

* added any2sparse_clipped() function

* changed full2sparse_clipped to any2sparse_clipped in __getitem__

* added missing whitespace

* return topn from any2sparse_clipped()

* efficient any2sparse_clipped implementation

* added unit test for any2sparse_clipped

* function call corrected

* removed any2sparse_clipped and added scipy2scipy_clipped

* added new code path for maintain_sparsity

* added unit tests for new function and issue

* fixed flake8 errors

* fixed matrix_indptr

* added requested changes

* replaced hasattr with getattr

* call abs() once for entire matrix in scipy2scipy_clipped

* removed matrix.sort_indices and removed indptr while calling argsort
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants