Sparse support when num_best not None in interfaces. Fixes #1294 #1321

manneshiva · 2017-05-14T21:29:36Z

Fixes #1294. Added a new function ( any2sparse_clipped() ), in matutils by extending any2sparse() to return the topn elements of greatest magnitude. Also, replaced full2sparse_clipped in interfaces.py to any2sparse_clipped to handle sparse cases when num_best isn't None.
I am fairly new to contributing to open source and so do not have much idea about testing my commits. I have only tested/verified against the code to reproduce the issue and it seemed to be working. Let me know in case I need to do anything else.

manneshiva · 2017-05-14T21:53:13Z

oops, forgot to add a whitespace after comma.

menshikh-iv · 2017-05-15T04:53:18Z

Thank you @manneshiva, at first, please write unit-test for check that there is no exception (you can rewrite this code from issue as unittest) and write test for any2sparse_clipped for checking output (with sparse and non-sparse inputs)

manneshiva · 2017-05-15T11:53:42Z

@menshikh-iv Thanks for your suggestions. I will write the unit-test asap. Regarding this comment, it is handled for dense inputs. For other(non-full) cases, we have two options:

Converting the input vector to sparse and then sorting these 2-tuples(implemented in current commit). (or)
Creating a full array from input vector and then proceeding like full2sparse_clipped().

I chose method (1), because I feel that in case of sparse vectors, length of the sparse 2-tuples << num_documents and thus sorting these tuples directly will take lesser time(which isn't the case with full inputs and therefore handled separately). Does this make sense? Would you rather recommend method (2) ?

tmylk · 2017-05-15T21:36:00Z

gensim/matutils.py

+    if isinstance(vec, np.ndarray):
+        return full2sparse_clipped(vec, topn, eps)
+    vec_sparse = any2sparse(vec, eps)
+    return sorted(vec_sparse, key=lambda x: x[1], reverse=True)


Please only return topn

tmylk · 2017-05-15T21:43:02Z

Could you please benchmark the performance of 1 and 2 for, say, 1% of non-zero elements in a 10k long array?

manneshiva · 2017-05-17T01:17:00Z

@menshikh-iv @tmylk In the process of re-writing the code from the issue for the unit test, I realized a fault in my/our understanding of the problem.

def testMaintainSparsity(self):
        """Sparsity is correctly maintained when maintain_sparsity=True"""
        num_features = len(dictionary)

        index = self.cls(corpus, num_features=num_features)
        dense_sims = index[corpus]

        index = self.cls(corpus, num_features=num_features, maintain_sparsity=True)
        sparse_sims = index[corpus]

        self.assertFalse(scipy.sparse.issparse(dense_sims))
        self.assertTrue(scipy.sparse.issparse(sparse_sims))
        numpy.testing.assert_array_equal(dense_sims, sparse_sims.todense())

As seen from this unit test, 'maintain_sparsity' indicates the type of the output(whether scipy sparse or not) and has nothing to do with the type of the input. In the case when num_best is not None, the output is expected to be a list of reverse sorted tuples(index,similarity). So basically, the problem is not the sparse input but rather asking for a sparse output when num_best=some_n. It does not make sense to return a scipy sparse matrix in the case when num_best is not None and maintain_sparsity is True, which is exactly what was assumed in the present code. The fix for this problem is to simply raise an error when trying to set maintain_sparsity to True and num_best=n. Should have caught this before. Or am I missing something?

tmylk · 2017-05-17T11:11:22Z

The goal of this issue was to support num_best is not None and maintain_sparsity=True case. You see that there is a problem with this. Could you please explain better what that problem is?

manneshiva · 2017-05-17T19:04:53Z

@tmylk Suppose we were to support num_best is not None and maintain_sparsity=True , what should be the output? To maintain sparity, is it sufficient to return a list of scipy.sparce.csr rows(i.e. list of 1xn sparse rows). For clarity, I am posting the output of the code that I have written:
Output of present gensim code for num_best=3 and maintain_sparsity=False
[[(0, 0.99999994039535522), (2, 0.28867512941360474), (1, 0.2357022613286972)], [(1, 1.0), (4, 0.70710676908493042), (2, 0.40824830532073975)], [(2, 1.0), (3, 0.61237245798110962), (1, 0.40824830532073975)], [(3, 1.0), (2, 0.61237245798110962), (1, 0.3333333432674408)], [(4, 0.99999994039535522), (1, 0.70710676908493042), (2, 0.28867512941360474)], [(5, 1.0), (6, 0.70710676908493042), (7, 0.57735025882720947)], [(6, 0.99999994039535522), (7, 0.81649655103683472), (5, 0.70710676908493042)], [(7, 0.99999994039535522), (6, 0.81649655103683472), (8, 0.66666662693023682)], [(8, 0.99999994039535522), (7, 0.66666662693023682), (6, 0.40824827551841736)]]
----------------------------------------------------------------------------------
Output of my current code for num_best=3 and maintain_sparsity=True

[<1x3 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x5 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x4 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x4 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x5 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x8 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x8 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x9 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>, <1x9 sparse matrix of type '<type 'numpy.float32'>'
with 3 stored elements in Compressed Sparse Row format>]

----------------------------------------------------------------------------------
Content of this list of sparse rows:
[[(0, 0.99999994), (2, 0.28867513), (1, 0.23570226)], [(1, 1.0), (4, 0.70710677), (2, 0.40824831)], [(2, 1.0), (3, 0.61237246), (1, 0.40824831)], [(3, 1.0), (2, 0.61237246), (1, 0.33333334)], [(4, 0.99999994), (1, 0.70710677), (2, 0.28867513)], [(5, 1.0), (6, 0.70710677), (7, 0.57735026)], [(6, 0.99999994), (7, 0.81649655), (5, 0.70710677)], [(7, 0.99999994), (6, 0.81649655), (8, 0.66666663)], [(8, 0.99999994), (7, 0.66666663), (6, 0.40824828)]]

The sparse output is the same as the dense output as seen above. But it is not possible to convert this list of sparse rows to a sparse matrix since the shape of each row is different(depends on the highest index in topn indices). Is it ok to return a list of sparse rows? What is the expected output?

manneshiva · 2017-05-17T20:13:03Z

@tmylk Found a way to return a sparse csr_matrix (possible even with variable shape of each row) instead of list of sparse rows as mentioned. But it loses the ordering of the values in each row. zip(row[i].indices, row[i].data) gives a different order rather than a decreasing similarity value order. Once you confirm on the required output, I shall immediately commit my changes. Thanks!

manneshiva · 2017-05-18T11:15:06Z

@tmylk Tried and tested a way to create a sparse scipy matrix which preserves the order. So the current output will be a scipy sparse order preserving matrix. Will code it and push the changes soon. Thanks for your patience.

manneshiva · 2017-05-18T22:11:32Z

@tmylk I have fixed the issue with the following changes:

Replaced any2sparse_clipped with scipy2scipy_clipped(function for clipping topn from a scipy matrix).
Added new code path for the case when maintain_sparse = True and num_best is not None.
Added unit tests for (1) and (2).

The output type (for maintain_sparse = True and num_best is not None) is now consistent with the output when maintain_sparse = True and num_best is None. I have also tested against the code in the issue and compared the output of sparse and non sparse when num_best is not None. The issue looks resolved. Please review and let me know if any changes are needed. Thanks.

manneshiva · 2017-05-29T16:49:30Z

@tmylk Let me know if any changes are required.

tmylk · 2017-05-29T21:13:45Z

gensim/test/test_similarities.py

+        """Tests that sparsity is correctly maintained when maintain_sparsity=True and num_best is not None"""
+        num_features = len(dictionary)
+
+        index = self.cls(corpus, num_features=num_features, num_best=3)


Explicit maintain_sparsity=False will add clarity.

tmylk · 2017-05-29T21:31:57Z

@manneshiva I am concerned about the call to matrix.data as it creates a numpy array in memory and thus the benefits of having sparse format disappear.

Please preserve sparsity when selecting num_best

piskvorky · 2017-05-30T02:30:32Z

gensim/interfaces.py

@@ -222,6 +222,12 @@ def __getitem__(self, query):
        if self.num_best is None:
            return result

+        # if maintain_sparity is True, result is scipy sparse. Sort, clip the
+        # topn and return as a scipy sparse matrix.
+        if hasattr(self, 'maintain_sparsity'):


getattr accepts a default value.

piskvorky · 2017-05-30T02:31:48Z

gensim/matutils.py

@@ -166,6 +166,39 @@ def any2sparse(vec, eps=1e-9):
    return [(int(fid), float(fw)) for fid, fw in vec if np.abs(fw) > eps]


+def scipy2scipy_clipped(matrix, topn, eps=1e-9):
+    """
+    Return a scipy.sparse vector/matrix consisting of 'topn' elements of greatest magnitude(abs).


of the greatest magnitude (absolute value).

piskvorky · 2017-05-30T02:33:07Z

gensim/matutils.py

+    # Return clipped sparse vector if input is a sparse vector.
+    if matrix.shape[0] == 1:
+        # use np.argpartition/argsort and only form tuples that are actually returned.
+        biggest = argsort(abs(matrix).data, topn, reverse=True)


abs(matrix.data) should be much faster than abs(matrix).data (numpy vs scipy).

piskvorky · 2017-05-30T02:34:32Z

gensim/matutils.py

+        matrix_indptr = [0]
+        for v in matrix:
+            # Sort and clip each row vector first.
+            biggest = argsort(abs(v).data, topn, reverse=True)


Dtto.

Even faster, but slightly hairier, to call abs just once, on the entire matrix data, rather than once per row.

manneshiva · 2017-05-31T12:17:01Z

@tmylk cc @piskvorky Will push the required changes soon.

@manneshiva I am concerned about the call to matrix.data as it creates a numpy array in memory and thus the benefits of having sparse format disappear.
Please preserve sparsity when selecting num_best

I am not sure about what you mean. matrix.data returns all nonzero values and not the full length(#docs) array thus still preserves sparsity. Am I missing something?

menshikh-iv · 2017-06-12T08:12:56Z

Please try run code from issue and fix this problem @manneshiva

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-57-3e2df5bae673> in <module>()
      6 index = SparseMatrixSimilarity(Sparse2Corpus(X, documents_columns=False), num_features=X.shape[1], num_terms=X.shape[0], maintain_sparsity=True, num_best=100)
      7 
----> 8 for s in index:
      9     print s, s.data

/home/ivan/.virtualenvs/cl/local/lib/python2.7/site-packages/gensim/interfaces.pyc in __iter__(self)
    269                 chunk_end = min(self.index.shape[0], chunk_start + self.chunksize)
    270                 chunk = self.index[chunk_start : chunk_end]
--> 271                 for sim in self[chunk]:
    272                     yield sim
    273         else:

/home/ivan/.virtualenvs/cl/local/lib/python2.7/site-packages/gensim/interfaces.pyc in __getitem__(self, query)
    226         # topn and return as a scipy sparse matrix.
    227         if getattr(self, 'maintain_sparsity', False):
--> 228                 return matutils.scipy2scipy_clipped(result, self.num_best)
    229 
    230         # if the input query was a corpus (=more documents), compute the top-n

/home/ivan/.virtualenvs/cl/local/lib/python2.7/site-packages/gensim/matutils.pyc in scipy2scipy_clipped(matrix, topn, eps)
    193             # Sort and clip each row vector first.
    194             biggest = argsort(matrix_abs[matrix.indptr[i]:matrix.indptr[i + 1]], topn, reverse=True)
--> 195             indices, data = v.indices.take(biggest), v.data.take(biggest)
    196             # Store the topn indices and values of each row vector.
    197             matrix_data.append(data)

IndexError: cannot do a non-empty take from an empty axes.

manneshiva · 2017-06-14T18:11:23Z

@menshikh-iv Made required changes and tested on code from issue (maintain_sparsity = True , num_best=100). Also checked correctness by comparing output with maintain_sparsity = False , num_best=100. The issue looks resolved. Please confirm and let me know.
`

menshikh-iv · 2017-06-22T09:30:34Z

Thank you @manneshiva

…one. Fix piskvorky#1294 (piskvorky#1321) * added any2sparse_clipped() function * changed full2sparse_clipped to any2sparse_clipped in __getitem__ * added missing whitespace * return topn from any2sparse_clipped() * efficient any2sparse_clipped implementation * added unit test for any2sparse_clipped * function call corrected * removed any2sparse_clipped and added scipy2scipy_clipped * added new code path for maintain_sparsity * added unit tests for new function and issue * fixed flake8 errors * fixed matrix_indptr * added requested changes * replaced hasattr with getattr * call abs() once for entire matrix in scipy2scipy_clipped * removed matrix.sort_indices and removed indptr while calling argsort

tmylk reviewed May 15, 2017

View reviewed changes

manneshiva and others added 12 commits May 26, 2017 01:29

added any2sparse_clipped() function

7029641

changed full2sparse_clipped to any2sparse_clipped in __getitem__

50cfef0

added missing whitespace

3a0924c

return topn from any2sparse_clipped()

5c10a69

efficient any2sparse_clipped implementation

a60a5db

added unit test for any2sparse_clipped

b33a364

function call corrected

02421cb

removed any2sparse_clipped and added scipy2scipy_clipped

b868818

added new code path for maintain_sparsity

99c9695

added unit tests for new function and issue

d0d9f62

fixed flake8 errors

e0b9fac

fixed matrix_indptr

3e9966c

tmylk reviewed May 29, 2017

View reviewed changes

piskvorky reviewed May 30, 2017

View reviewed changes

added requested changes

a3c1eb4

manneshiva force-pushed the fix1294 branch from 690251e to a3c1eb4 Compare May 31, 2017 18:40

manneshiva added 2 commits June 1, 2017 02:01

replaced hasattr with getattr

f970113

call abs() once for entire matrix in scipy2scipy_clipped

c0033d6

removed matrix.sort_indices and removed indptr while calling argsort

4d0e9e8

menshikh-iv merged commit dfd7da4 into piskvorky:develop Jun 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse support when num_best not None in interfaces. Fixes #1294 #1321

Sparse support when num_best not None in interfaces. Fixes #1294 #1321

manneshiva commented May 14, 2017

manneshiva commented May 14, 2017

menshikh-iv commented May 15, 2017

manneshiva commented May 15, 2017 •

edited

Loading

tmylk May 15, 2017

tmylk commented May 15, 2017

manneshiva commented May 17, 2017

tmylk commented May 17, 2017

manneshiva commented May 17, 2017 •

edited

Loading

manneshiva commented May 17, 2017

manneshiva commented May 18, 2017

manneshiva commented May 18, 2017

manneshiva commented May 29, 2017

tmylk May 29, 2017

tmylk commented May 29, 2017

piskvorky May 30, 2017

piskvorky May 30, 2017

piskvorky May 30, 2017

piskvorky May 30, 2017 •

edited

Loading

manneshiva commented May 31, 2017

menshikh-iv commented Jun 12, 2017

manneshiva commented Jun 14, 2017

menshikh-iv commented Jun 22, 2017

Sparse support when num_best not None in interfaces. Fixes #1294 #1321

Sparse support when num_best not None in interfaces. Fixes #1294 #1321

Conversation

manneshiva commented May 14, 2017

manneshiva commented May 14, 2017

menshikh-iv commented May 15, 2017

manneshiva commented May 15, 2017 • edited Loading

tmylk May 15, 2017

Choose a reason for hiding this comment

tmylk commented May 15, 2017

manneshiva commented May 17, 2017

tmylk commented May 17, 2017

manneshiva commented May 17, 2017 • edited Loading

manneshiva commented May 17, 2017

manneshiva commented May 18, 2017

manneshiva commented May 18, 2017

manneshiva commented May 29, 2017

tmylk May 29, 2017

Choose a reason for hiding this comment

tmylk commented May 29, 2017

piskvorky May 30, 2017

Choose a reason for hiding this comment

piskvorky May 30, 2017

Choose a reason for hiding this comment

piskvorky May 30, 2017

Choose a reason for hiding this comment

piskvorky May 30, 2017 • edited Loading

Choose a reason for hiding this comment

manneshiva commented May 31, 2017

menshikh-iv commented Jun 12, 2017

manneshiva commented Jun 14, 2017

menshikh-iv commented Jun 22, 2017

manneshiva commented May 15, 2017 •

edited

Loading

manneshiva commented May 17, 2017 •

edited

Loading

piskvorky May 30, 2017 •

edited

Loading