-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Levenshtein term similarity matrix and fast SCM between corpora #2016
Merged
menshikh-iv
merged 66 commits into
piskvorky:develop
from
Witiko:levenshtein-softcossim
Jan 14, 2019
Merged
Changes from 7 commits
Commits
Show all changes
66 commits
Select commit
Hold shift + click to select a range
ccadc8d
Wrap docstring for WordEmbeddingsKeyedVectors.similarity_matrix
Witiko 517bcc8
Add the gensim.models.levenshtein module
Witiko e71b6ff
Add projected density to term similarity matrix logs
Witiko b8425af
Add tests for the gensim.models.levenshtein.similarity_matrix function
Witiko 80c13ef
Separate similarity_matrix methods into director and builder classes.
Witiko 6f6cdb7
Add symmetric parameter to SparseTermSimilarityMatrix
Witiko 7274fac
Add corpus support to SparseTermSimilarityMatrix.inner_product
Witiko 27e76b8
Replace scipy.sparse.dok_matrix.has_key with the in operator
Witiko 739383a
Fix handling of unicode in Python 3 in levsim
Witiko 9ecae3c
Remove temporary method similarity of LevenshteinSimilarityIndex
Witiko 49a2160
Move models.term_similarity, and levenshtein to similarities
Witiko c5669fc
Make python-Levenshtein a conditional import
Witiko 7b774dd
Add default values to gensim.similarities.levenshtein.levsim arguments
Witiko 2e8d4fa
Remove extraneous addition operators from @deprecated annotations
Witiko a6e295f
Remove @deprecated annotation from tests
Witiko 13948dc
Merge test_term_similarity, and test_levenshtein with test_similarities
Witiko a9706de
Reword TermSimilarityIndex docstring
Witiko 5e3e948
Consume no more than topn similarities produced by a TermSimilarityIndex
Witiko 4b895ff
Use short uints (<64b) for dok_matrix keys and num_nonzero array
Witiko 5c100a9
Write to matrix_nonzero only when building a symmetric matrix
Witiko 0efed5e
Ensure UniformTermSimilarityIndex does not yield only topn - 1 values
Witiko 0c3549b
Document _shortest_uint_dtype
Witiko ee33db8
Add soft cosine measure benchmark, part 1
Witiko da6e6dd
Add soft cosine measure benchmark, part 2
Witiko d4053b2
Make similarity_matrix support non-contiguous dictionaries
Witiko 093d569
Support fast inner product between a document and a corpus
Witiko c2888b4
Support fast inner product between a document and a corpus (python 2.7)
Witiko 32cb4d7
Add faster sparse matrix slicing
Witiko 099d768
Make Soft Cosine Measure support non-contiguous dictionaries
Witiko dd4561d
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko c8f6ef5
Remove gensim::similarities::levenshtein::similarity_matrix facade
Witiko 8f026cc
Implement SoftCosineSimilarity using the inner_product method
Witiko 227d09e
Fix flake8 warnings
Witiko 9f8d0e8
Make Soft Cosine Measure support non-contiguous dictionaries (cont)
Witiko c316b95
Remove parallelization in gensim::similarities::levenshtein
Witiko d6b9bd4
Document future work
Witiko 5e52477
Update Soft Cosine Measure benchmark after commits 093d569, and c316b95
Witiko 4b46597
Update SCM tutorial after PR 2016
Witiko ce95fd9
Add example to gensim::similarities::termsim::SparseTermSimilarityMatrix
Witiko f8ff4c7
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko ac60615
Add max_distance kwarg to gensim::similarities::levenshtein::levsim
Witiko 5154569
Replace max_distance kwarg in levsim with min_similarity, add tests
Witiko 729d185
Remove conditional expression from levsim
Witiko 155dc58
Use less confusing wording in docsting for min_similarity / max_distance
Witiko 7e52ef8
Defer thresholding in LevenshteinSimilarityIndex.most_similar to levsim
Witiko 3866bc9
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko a7ee779
Allow None value of nonzero_limit parameter in SparseTermSimilarityMa…
Witiko e4395e0
Add positive_definite parameter to SparseTermSimilarityMatrix
Witiko 98f3f3d
Split test_building test into a number of atomic unit tests
Witiko 2a55786
Presort dictionary keys in UniformTermSimilarityIndex constructor
Witiko 4d8dc48
Make documentation of SparseTermSimilarityMatrix more accurate
Witiko d7fd3f1
Make SparseTermSimilarityMatrix expect negative similarities
Witiko 46a477e
Avoid expensive array copying in dot_product
Witiko 583c9c7
Update SCM tutorial, and benchmark after PR 2016
Witiko 4f26de0
Merge branch 'develop' into levenshtein-softcossim
Witiko 4d8338e
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko 1cc4a49
Remove fluff from stderr in the SCM tutorial notebook
Witiko 9ede310
Add a paper reference to the SCM tutorial notebook
Witiko c523aa5
Directly import Levenshtein package in levdist
Witiko e031630
Use embedded URI instead of indirect hyperlink target in documentation
Witiko 19bedf1
Assume that max of lens is always an integer
Witiko 83a07af
Make LevenshteinSimilarityIndex.most_similar easier to read
Witiko f3258d9
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko 16ff7ef
Make LevenshteinSimilarityIndex.most_similar easier to read
Witiko 12ee910
Add an ordering test for LevenshteinSimilarityIndex.most_similar
Witiko 3f04940
Make WordEmbeddingSimilarityIndex.most_similar easier to read
Witiko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: no need to use
+
for concatenation if this happens in()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this once we figure out what to actually deprecate.