Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Levenshtein term similarity matrix and fast SCM between corpora #2016

Merged
merged 66 commits into from
Jan 14, 2019
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
ccadc8d
Wrap docstring for WordEmbeddingsKeyedVectors.similarity_matrix
Witiko Mar 26, 2018
517bcc8
Add the gensim.models.levenshtein module
Witiko Mar 26, 2018
e71b6ff
Add projected density to term similarity matrix logs
Witiko Mar 27, 2018
b8425af
Add tests for the gensim.models.levenshtein.similarity_matrix function
Witiko Apr 3, 2018
80c13ef
Separate similarity_matrix methods into director and builder classes.
Witiko Apr 4, 2018
6f6cdb7
Add symmetric parameter to SparseTermSimilarityMatrix
Witiko Apr 4, 2018
7274fac
Add corpus support to SparseTermSimilarityMatrix.inner_product
Witiko Apr 4, 2018
27e76b8
Replace scipy.sparse.dok_matrix.has_key with the in operator
Witiko Apr 5, 2018
739383a
Fix handling of unicode in Python 3 in levsim
Witiko Apr 5, 2018
9ecae3c
Remove temporary method similarity of LevenshteinSimilarityIndex
Witiko Apr 5, 2018
49a2160
Move models.term_similarity, and levenshtein to similarities
Witiko Apr 11, 2018
c5669fc
Make python-Levenshtein a conditional import
Witiko Apr 11, 2018
7b774dd
Add default values to gensim.similarities.levenshtein.levsim arguments
Witiko Apr 11, 2018
2e8d4fa
Remove extraneous addition operators from @deprecated annotations
Witiko Apr 11, 2018
a6e295f
Remove @deprecated annotation from tests
Witiko Apr 11, 2018
13948dc
Merge test_term_similarity, and test_levenshtein with test_similarities
Witiko Apr 11, 2018
a9706de
Reword TermSimilarityIndex docstring
Witiko Apr 11, 2018
5e3e948
Consume no more than topn similarities produced by a TermSimilarityIndex
Witiko Apr 11, 2018
4b895ff
Use short uints (<64b) for dok_matrix keys and num_nonzero array
Witiko Apr 12, 2018
5c100a9
Write to matrix_nonzero only when building a symmetric matrix
Witiko Apr 16, 2018
0efed5e
Ensure UniformTermSimilarityIndex does not yield only topn - 1 values
Witiko Apr 16, 2018
0c3549b
Document _shortest_uint_dtype
Witiko Apr 16, 2018
ee33db8
Add soft cosine measure benchmark, part 1
Witiko Apr 22, 2018
da6e6dd
Add soft cosine measure benchmark, part 2
Witiko Apr 23, 2018
d4053b2
Make similarity_matrix support non-contiguous dictionaries
Witiko May 13, 2018
093d569
Support fast inner product between a document and a corpus
Witiko May 20, 2018
c2888b4
Support fast inner product between a document and a corpus (python 2.7)
Witiko May 20, 2018
32cb4d7
Add faster sparse matrix slicing
Witiko Jul 1, 2018
099d768
Make Soft Cosine Measure support non-contiguous dictionaries
Witiko Jul 1, 2018
dd4561d
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jul 1, 2018
c8f6ef5
Remove gensim::similarities::levenshtein::similarity_matrix facade
Witiko Jul 1, 2018
8f026cc
Implement SoftCosineSimilarity using the inner_product method
Witiko Jul 1, 2018
227d09e
Fix flake8 warnings
Witiko Jul 1, 2018
9f8d0e8
Make Soft Cosine Measure support non-contiguous dictionaries (cont)
Witiko Jul 1, 2018
c316b95
Remove parallelization in gensim::similarities::levenshtein
Witiko Jul 2, 2018
d6b9bd4
Document future work
Witiko Jul 2, 2018
5e52477
Update Soft Cosine Measure benchmark after commits 093d569, and c316b95
Witiko Jul 12, 2018
4b46597
Update SCM tutorial after PR 2016
Witiko Jul 12, 2018
ce95fd9
Add example to gensim::similarities::termsim::SparseTermSimilarityMatrix
Witiko Jul 12, 2018
f8ff4c7
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jul 12, 2018
ac60615
Add max_distance kwarg to gensim::similarities::levenshtein::levsim
Witiko Jul 13, 2018
5154569
Replace max_distance kwarg in levsim with min_similarity, add tests
Witiko Jul 22, 2018
729d185
Remove conditional expression from levsim
Witiko Jul 23, 2018
155dc58
Use less confusing wording in docsting for min_similarity / max_distance
Witiko Jul 23, 2018
7e52ef8
Defer thresholding in LevenshteinSimilarityIndex.most_similar to levsim
Witiko Jul 23, 2018
3866bc9
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jul 30, 2018
a7ee779
Allow None value of nonzero_limit parameter in SparseTermSimilarityMa…
Witiko Aug 16, 2018
e4395e0
Add positive_definite parameter to SparseTermSimilarityMatrix
Witiko Aug 16, 2018
98f3f3d
Split test_building test into a number of atomic unit tests
Witiko Aug 16, 2018
2a55786
Presort dictionary keys in UniformTermSimilarityIndex constructor
Witiko Aug 17, 2018
4d8dc48
Make documentation of SparseTermSimilarityMatrix more accurate
Witiko Aug 25, 2018
d7fd3f1
Make SparseTermSimilarityMatrix expect negative similarities
Witiko Aug 25, 2018
46a477e
Avoid expensive array copying in dot_product
Witiko Sep 9, 2018
583c9c7
Update SCM tutorial, and benchmark after PR 2016
Witiko Sep 11, 2018
4f26de0
Merge branch 'develop' into levenshtein-softcossim
Witiko Sep 11, 2018
4d8338e
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jan 9, 2019
1cc4a49
Remove fluff from stderr in the SCM tutorial notebook
Witiko Jan 11, 2019
9ede310
Add a paper reference to the SCM tutorial notebook
Witiko Jan 11, 2019
c523aa5
Directly import Levenshtein package in levdist
Witiko Jan 11, 2019
e031630
Use embedded URI instead of indirect hyperlink target in documentation
Witiko Jan 11, 2019
19bedf1
Assume that max of lens is always an integer
Witiko Jan 11, 2019
83a07af
Make LevenshteinSimilarityIndex.most_similar easier to read
Witiko Jan 11, 2019
f3258d9
Merge remote-tracking branch 'upstream/develop' into levenshtein-soft…
Witiko Jan 11, 2019
16ff7ef
Make LevenshteinSimilarityIndex.most_similar easier to read
Witiko Jan 12, 2019
12ee910
Add an ordering test for LevenshteinSimilarityIndex.most_similar
Witiko Jan 12, 2019
3f04940
Make WordEmbeddingSimilarityIndex.most_similar easier to read
Witiko Jan 12, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions gensim/matutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import math

from gensim import utils
from gensim.utils import deprecated

import numpy as np
import scipy.sparse
Expand Down Expand Up @@ -775,6 +776,9 @@ def cossim(vec1, vec2):
return result


@deprecated(
"Function will be removed in 4.0.0, use " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: no need to use + for concatenation if this happens in ().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this once we figure out what to actually deprecate.

"gensim.models.term_similarity.SparseTermSimilarityMatrix.inner_product instead")
def softcossim(vec1, vec2, similarity_matrix):
"""Get Soft Cosine Measure between two vectors given a term similarity matrix.

Expand All @@ -789,8 +793,10 @@ def softcossim(vec1, vec2, similarity_matrix):
vec2 : list of (int, float)
A document vector in the BoW format.
similarity_matrix : {:class:`scipy.sparse.csc_matrix`, :class:`scipy.sparse.csr_matrix`}
A term similarity matrix, typically produced by
:meth:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`.
A term similarity matrix. If the matrix is :class:`scipy.sparse.csr_matrix`, it is going
to be transposed. If you rely on the fact that there is at most a constant number of
non-zero elements in a single column, it is your responsibility to ensure that the matrix
is symmetric.

Returns
-------
Expand All @@ -806,6 +812,8 @@ def softcossim(vec1, vec2, similarity_matrix):
--------
:meth:`gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`
A term similarity matrix produced from term embeddings.
:func:`gensim.models.levenshtein.similarity_matrix`
A term similarity matrix produced from Levenshtein distances.
:class:`gensim.similarities.docsim.SoftCosineSimilarity`
A class for performing corpus-based similarity queries with Soft Cosine Measure.

Expand Down
5 changes: 4 additions & 1 deletion gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,18 @@
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec # noqa:F401
from .doc2vec import Doc2Vec # noqa:F401
from .keyedvectors import KeyedVectors # noqa:F401
from .keyedvectors import KeyedVectors, WordEmbeddingSimilarityIndex # noqa:F401
from .ldamulticore import LdaMulticore # noqa:F401
from .phrases import Phrases # noqa:F401
from .normmodel import NormModel # noqa:F401
from .atmodel import AuthorTopicModel # noqa:F401
from .ldaseqmodel import LdaSeqModel # noqa:F401
from .fasttext import FastText # noqa:F401
from .translation_matrix import TranslationMatrix, BackMappingTranslationMatrix # noqa:F401
from .term_similarity import TermSimilarityIndex, UniformTermSimilarityIndex, SparseTermSimilarityMatrix # noqa:F401
from .levenshtein import LevenshteinSimilarityIndex # noqa:F401

from . import levenshtein # noqa:F401
from . import wrappers # noqa:F401
from . import deprecated # noqa:F401

Expand Down
148 changes: 70 additions & 78 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,13 +76,15 @@
double, array, zeros, vstack, sqrt, newaxis, integer, \
ndarray, sum as np_sum, prod, argmax, divide as np_divide
import numpy as np

from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc
from gensim.corpora.dictionary import Dictionary
from six import string_types, integer_types
from six.moves import xrange, zip
from scipy import sparse, stats
from scipy import stats
from gensim.utils import deprecated
from gensim.models.utils_any2vec import _save_word2vec_format, _load_word2vec_format, _compute_ngrams, _ft_hash
from gensim.models.term_similarity import TermSimilarityIndex, SparseTermSimilarityMatrix

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -497,33 +499,33 @@ def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
"""
return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

@deprecated(
"Method will be removed in 4.0.0, use " +
"gensim.models.keyedvectors.WordEmbeddingSimilarityIndex instead")
def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=REAL):
"""Constructs a term similarity matrix for computing Soft Cosine Measure.

Constructs a a sparse term similarity matrix in the :class:`scipy.sparse.csc_matrix` format for computing
Soft Cosine Measure between documents.
Constructs a sparse term similarity matrix in the :class:`scipy.sparse.csc_matrix` format
for computing Soft Cosine Measure between documents.

Parameters
----------
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`
A dictionary that specifies a mapping between words and the indices of rows and columns
of the resulting term similarity matrix.
tfidf : :class:`gensim.models.tfidfmodel.TfidfModel`, optional
A model that specifies the relative importance of the terms in the dictionary. The rows
of the term similarity matrix will be build in a decreasing order of importance of terms,
or in the order of term identifiers if None.
A dictionary that specifies the considered terms.
tfidf : :class:`gensim.models.tfidfmodel.TfidfModel` or None, optional
A model that specifies the relative importance of the terms in the dictionary. The
columns of the term similarity matrix will be build in a decreasing order of importance
of terms, or in the order of term identifiers if None.
threshold : float, optional
Only pairs of words whose embeddings are more similar than `threshold` are considered
when building the sparse term similarity matrix.
Only embeddings more similar than `threshold` are considered when retrieving word
embeddings closest to a given word embedding.
exponent : float, optional
The exponent applied to the similarity between two word embeddings when building the term similarity matrix.
Take the word embedding similarities larger than `threshold` to the power of `exponent`.
nonzero_limit : int, optional
The maximum number of non-zero elements outside the diagonal in a single row or column
of the term similarity matrix. Setting `nonzero_limit` to a constant ensures that the
time complexity of computing the Soft Cosine Measure will be linear in the document
length rather than quadratic.
The maximum number of non-zero elements outside the diagonal in a single column of the
sparse term similarity matrix.
dtype : numpy.dtype, optional
Data-type of the term similarity matrix.
Data-type of the sparse term similarity matrix.

Returns
-------
Expand All @@ -536,75 +538,22 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
The Soft Cosine Measure.
:class:`gensim.similarities.docsim.SoftCosineSimilarity`
A class for performing corpus-based similarity queries with Soft Cosine Measure.
:func:`gensim.models.levenshtein.similarity_matrix`
A term similarity matrix produced from Levenshtein distances.


Notes
-----
The constructed matrix corresponds to the matrix Mrel defined in section 2.1 of
`Delphine Charlet and Geraldine Damnati, "SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity
between Questions for Community Question Answering", 2017
`Delphine Charlet and Geraldine Damnati, "SimBow at SemEval-2017 Task 3: Soft-Cosine
Semantic Similarity between Questions for Community Question Answering", 2017
<http://www.aclweb.org/anthology/S/S17/S17-2051.pdf>`__.

"""
logger.info("constructing a term similarity matrix")
matrix_order = len(dictionary)
matrix_nonzero = [1] * matrix_order
matrix = sparse.identity(matrix_order, dtype=dtype, format="dok")
num_skipped = 0
# Decide the order of rows.
if tfidf is None:
word_indices = range(matrix_order)
else:
assert max(tfidf.idfs) < matrix_order
word_indices = [
index for index, _
in sorted(tfidf.idfs.items(), key=lambda x: (x[1], -x[0]), reverse=True)
]

# Traverse rows.
for row_number, w1_index in enumerate(word_indices):
if row_number % 1000 == 0:
logger.info(
"PROGRESS: at %.02f%% rows (%d / %d, %d skipped, %.06f%% density)",
100.0 * (row_number + 1) / matrix_order, row_number + 1, matrix_order,
num_skipped, 100.0 * matrix.getnnz() / matrix_order**2)
w1 = dictionary[w1_index]
if w1 not in self.vocab:
num_skipped += 1
continue # A word from the dictionary is not present in the word2vec model.

# Traverse upper triangle columns.
if matrix_order <= nonzero_limit + 1: # Traverse all columns.
columns = (
(w2_index, self.similarity(w1, dictionary[w2_index]))
for w2_index in range(w1_index + 1, matrix_order)
if w1_index != w2_index and dictionary[w2_index] in self.vocab)
else: # Traverse only columns corresponding to the embeddings closest to w1.
num_nonzero = matrix_nonzero[w1_index] - 1
columns = (
(dictionary.token2id[w2], similarity)
for _, (w2, similarity)
in zip(
range(nonzero_limit - num_nonzero),
self.most_similar(positive=[w1], topn=nonzero_limit - num_nonzero)
)
if w2 in dictionary.token2id
)
columns = sorted(columns, key=lambda x: x[0])

for w2_index, similarity in columns:
# Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
if similarity > threshold and matrix_nonzero[w2_index] <= nonzero_limit:
element = similarity**exponent
matrix[w1_index, w2_index] = element
matrix_nonzero[w1_index] += 1
matrix[w2_index, w1_index] = element
matrix_nonzero[w2_index] += 1
logger.info(
"constructed a term similarity matrix with %0.6f %% nonzero elements",
100.0 * matrix.getnnz() / matrix_order**2
)
return matrix.tocsc()
index = WordEmbeddingSimilarityIndex(self, threshold=threshold, exponent=exponent)
similarity_matrix = SparseTermSimilarityMatrix(
index, dictionary, tfidf=tfidf, nonzero_limit=nonzero_limit, dtype=dtype)
return similarity_matrix.matrix

def wmdistance(self, document1, document2):
"""
Expand Down Expand Up @@ -1110,6 +1059,49 @@ def init_sims(self, replace=False):
self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)


class WordEmbeddingSimilarityIndex(TermSimilarityIndex):
"""
Computes cosine similarities between word embeddings and retrieves the closest word embeddings
by cosine similarity for a given word embedding.

Parameters
----------
keyedvectors : :class:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors`
The word embeddings.
threshold : float, optional
Only embeddings more similar than `threshold` are considered when retrieving word embeddings
closest to a given word embedding.
exponent : float, optional
Take the word embedding similarities larger than `threshold` to the power of `exponent`.
kwargs : dict or None
A dict with keyword arguments that will be passed to the `keyedvectors.most_similar` method
when retrieving the word embeddings closest to a given word embedding.

See Also
--------
:class:`~gensim.models.term_similarity.SparseTermSimilarityMatrix`
Build a term similarity matrix and compute the Soft Cosine Measure.

"""
def __init__(self, keyedvectors, threshold=0.0, exponent=2.0, kwargs=None):
assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors)
self.keyedvectors = keyedvectors
self.threshold = threshold
self.exponent = exponent
self.kwargs = kwargs or {}
super(WordEmbeddingSimilarityIndex, self).__init__()

def most_similar(self, t1, topn=10):
if t1 not in self.keyedvectors.vocab:
logger.debug('an out-of-dictionary term "%s"', t1)
else:
for _, (t2, similarity) in zip(
Witiko marked this conversation as resolved.
Show resolved Hide resolved
range(topn), self.keyedvectors.most_similar(
positive=[t1], topn=topn, **self.kwargs)):
if similarity > self.threshold:
yield (t2, similarity**self.exponent)


class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors):
"""Class to contain vectors and vocab for word2vec model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
Expand Down
Loading