Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary #3157

Merged
merged 39 commits into from
Jun 29, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
ab6fb90
Add KeyedVectors.vectors_for_all
Witiko May 25, 2021
98ed69d
Add examples for KeyedVectors.vectors_for_all
Witiko May 25, 2021
be1746b
Support Dictionary in KeyedVectors.vectors_for_all
Witiko May 28, 2021
d81df64
Don't sort keys in KeyedVectors.vectors_for_all, just deduplicate
Witiko May 28, 2021
ef8bea6
Use docstrings in imperative mode (PEP8)
Witiko May 28, 2021
d602018
Guard against KeyError in KeyedVectors.vectors_for_all
Witiko May 28, 2021
13a7ecd
Unit-test dictionary parameter of KeyedVectors.vectors_for_all
Witiko May 28, 2021
6a8c688
Order dictionary by decreasing cfs in KeyedVectors.vectors_for_all
Witiko May 28, 2021
9ebe808
Add allow_inference parameter to KeyedVectors.vectors_for_all
Witiko May 28, 2021
716dc32
Add copy_vecattrs parameter to KeyedVectors.vectors_for_all
Witiko May 28, 2021
77e1889
Move copy_vecattrs tests for KeyedVectors.vectors_for_all
Witiko May 28, 2021
330d5f7
Fix translation of term ids to terms in KeyedVectors.vectors_for_all
Witiko May 28, 2021
8fdda93
Fix a typo in KeyedVectors.vectors_for_all unit test
Witiko May 28, 2021
ba636a2
Do not make assumptions about fake counts in _add_word_to_kv
Witiko May 28, 2021
1a9ea9b
Document that KeyedVectors.vectors_for_all allows arbitrary keys
Witiko May 28, 2021
e5a9a31
Add notes about the behavior of KeyedVectors.vectors_for_all
Witiko May 28, 2021
5eebef0
Properly reference Dictionary in KeyedVectors.vectors_for_all docstring
Witiko May 28, 2021
26baf6d
Make deduplication in KeyedVectors.vectors_for_all a oneliner
Witiko May 31, 2021
98c070e
Remove an unnecessary temporary variable in KeyedVectors.vectors_for_all
Witiko May 31, 2021
8e4d0cf
Make deduplication in KeyedVectors.vectors_for_all a oneliner (cont.)
Witiko May 31, 2021
a4590c1
Add Dictionary.most_common
Witiko May 31, 2021
b14298b
Remove test_vectors_for_all_dictionary unit test
Witiko May 31, 2021
1cf9452
Remove a trailing bracket in an example
Witiko May 31, 2021
9c6f296
Fix unit tests for Dictionary.most_common
Witiko May 31, 2021
e78bfa3
Update an example for SparseTermSimilarityMatrix
Witiko May 31, 2021
32c14c5
Remove Gensim downloader from KeyedVectors.vectors_for_all example
Witiko Jun 22, 2021
9acbcba
Remove include_counts parameter from Dictionary.most_common
Witiko Jun 22, 2021
712ee61
Shorten the KeyedVectors.vectors_for_all example
Witiko Jun 22, 2021
b8625a5
Remove include_counts parameter from Dictionary.most_common (cont.)
Witiko Jun 22, 2021
4aacad2
Use pytest assertion syntax in unit tests
Witiko Jun 22, 2021
a86522c
Remove an unnecessary comment in KeyedVectors.vectors_for_all
Witiko Jun 22, 2021
7ea8337
Remove an unnecessary comment in KeyedVectors.vectors_for_all
Witiko Jun 22, 2021
f08c582
Remove an unnecessary variable in KeyedVectors.vectors_for_all
Witiko Jun 22, 2021
ebc276d
Make the creation of new vocab in KeyedVectors.vectors_for_all explicit
Witiko Jun 22, 2021
3bf7f33
Make AnnoyIndexer use the correct word-vectors in example
Witiko Jun 22, 2021
68b5fc1
Apply suggestions from code review
mpenkov Jun 29, 2021
52e5ee8
Apply suggestions from code review
mpenkov Jun 29, 2021
4dc3756
Update CHANGELOG.md
mpenkov Jun 29, 2021
d319144
Merge branch 'develop' into feature/vectors-for-all
mpenkov Jun 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@
import itertools
import warnings
from numbers import Integral
from typing import Iterable, Union
from collections import OrderedDict

from numpy import (
dot, float32 as REAL, double, array, zeros, vstack,
Expand Down Expand Up @@ -1695,6 +1697,44 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut
msg=f"merged {overlap_count} vectors into {self.vectors.shape} matrix from {fname}",
)

def vectors_for_all(self, keys: Union[Iterable, Dictionary]) -> 'KeyedVectors':
Witiko marked this conversation as resolved.
Show resolved Hide resolved
"""Produces vectors for all given keys.
Witiko marked this conversation as resolved.
Show resolved Hide resolved

Notes
-----
A new :class:`KeyedVectors` object will always be produced.

In subclasses such as :class:`~gensim.models.fasttext.FastTextKeyedVectors`,
vectors for out-of-vocabulary keys (words) may be inferred. In other classes
such as :class:`KeyedVectors`, out-of-vocabulary keys will be omitted
in the produced :class:`KeyedVectors` object.

Additional attributes set via the :meth:`KeyedVectors.set_vecattr` method
will not be preserved in the produced :class:`KeyedVectors` object.

Parameters
----------
keys : {iterable of str, Dictionary}
The keys that will be vectorized.

Returns
-------
keyedvectors : :class:`~gensim.models.keyedvectors.KeyedVectors`
Vectors for all the given keys.

"""
if isinstance(keys, Dictionary):
vocabulary = keys.token2id
else:
vocabulary = list(OrderedDict.fromkeys(keys))
vocab_size = len(vocabulary)
datatype = self.vectors.dtype
kv = KeyedVectors(self.vector_size, vocab_size, dtype=datatype)
for key in vocabulary:
weights = self[key]
_add_word_to_kv(kv, None, key, weights, vocab_size)
return kv

def _upconvert_old_d2vkv(self):
"""Convert a deserialized older Doc2VecKeyedVectors instance to latest generic KeyedVectors"""
self.vocab = self.doctags
Expand Down
25 changes: 21 additions & 4 deletions gensim/similarities/termsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,25 @@ class WordEmbeddingSimilarityIndex(TermSimilarityIndex):
Computes cosine similarities between word embeddings and retrieves most
similar terms for a given term.

Notes
-----
By fitting the word embeddings to a vocabulary that you will be using, you
will eliminate all out-of-vocabulary (OOV) words that you would otherwise
receive from the `most_similar` method:

>>> from gensim.test.utils import common_texts, datapath
>>> from gensim.corpora import Dictionary
>>> from gensim.models import FastText
>>> from gensim.models.word2vec import LineSentence
>>> from gensim.similarities import WordEmbeddingSimilarityIndex
>>>
>>> model = FastText(common_texts, vector_size=20, min_count=1) # train word-vectors on a corpus
>>> different_corpus = LineSentence(datapath('lee_background.cor'))
>>> dictionary = Dictionary(different_corpus) # construct a vocabulary on a different corpus
>>> word_vectors = model.wv.vectors_for_all(dictionary) # remove OOV word-vectors and infer new words
>>> assert len(dictionary) == len(word_vectors) # all words from our vocabulary received their word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(word_vectors)

Parameters
----------
keyedvectors : :class:`~gensim.models.keyedvectors.KeyedVectors`
Expand Down Expand Up @@ -409,20 +428,18 @@ class SparseTermSimilarityMatrix(SaveLoad):
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex
>>> from gensim.similarities.index import AnnoyIndexer
>>> from scikits.sparse.cholmod import cholesky
>>>
>>> model = Word2Vec(common_texts, vector_size=20, min_count=1) # train word-vectors
>>> annoy = AnnoyIndexer(model, num_trees=2) # use annoy for faster word similarity lookups
Witiko marked this conversation as resolved.
Show resolved Hide resolved
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv, kwargs={'indexer': annoy})
>>> dictionary = Dictionary(common_texts)
>>> word_vectors = model.wv.vectors_for_all(dictionary)
>>> termsim_index = WordEmbeddingSimilarityIndex(word_vectors, kwargs={'indexer': annoy})
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, symmetric=True, dominant=True)
>>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> query = 'graph trees computer'.split() # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)] # calculate similarity of query to each doc from bow_corpus
>>>
>>> word_embeddings = cholesky(similarity_matrix.matrix).L() # obtain word embeddings from similarity matrix

Check out `the Gallery <https://radimrehurek.com/gensim/auto_examples/tutorials/run_scm.html>`_
for more examples.
Expand Down
29 changes: 29 additions & 0 deletions gensim/test/test_fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -850,6 +850,35 @@ def obsolete_testLoadOldModel(self):
self.assertEqual(model.wv.vectors_vocab.shape, (12, 100))
self.assertEqual(model.wv.vectors_ngrams.shape, (2000000, 100))

def test_vectors_for_all(self):
"""Test vectors_for_all returns expected results."""
words = [
'responding',
'approached',
'chairman',
'an out-of-vocabulary word',
'another out-of-vocabulary word',
]
vectors_for_all = self.test_model.wv.vectors_for_all(words)

expected = 5
predicted = len(vectors_for_all)
self.assertEqual(expected, predicted)

expected = self.test_model.wv['responding']
predicted = vectors_for_all['responding']
self.assertTrue(np.allclose(expected, predicted))
mpenkov marked this conversation as resolved.
Show resolved Hide resolved

smaller_distance = np.linalg.norm(
vectors_for_all['an out-of-vocabulary word']
- vectors_for_all['another out-of-vocabulary word']
)
greater_distance = np.linalg.norm(
vectors_for_all['an out-of-vocabulary word']
- vectors_for_all['responding']
)
self.assertGreater(greater_distance, smaller_distance)
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertGreater(greater_distance, smaller_distance)
assert greater_distance > smaller_distance



with open(datapath('toy-data.txt')) as fin:
TOY_SENTENCES = [fin.read().strip().split(' ')]
Expand Down
19 changes: 19 additions & 0 deletions gensim/test/test_keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,25 @@ def test_most_similar(self):
predicted = [result[0] for result in self.vectors.most_similar('war', topn=5)]
self.assertEqual(expected, predicted)

def test_vectors_for_all(self):
"""Test vectors_for_all returns expected results."""
words = [
'conflict',
'administration',
'terrorism',
'an out-of-vocabulary word',
'another out-of-vocabulary word',
]
vectors_for_all = self.vectors.vectors_for_all(words)

expected = 3
predicted = len(vectors_for_all)
self.assertEqual(expected, predicted)

expected = self.vectors['conflict']
predicted = vectors_for_all['conflict']
self.assertTrue(np.allclose(expected, predicted))
mpenkov marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertTrue(np.allclose(expected, predicted))
assert np.allclose(expected, predicted)


def test_most_similar_topn(self):
"""Test most_similar returns correct results when `topn` is specified."""
self.assertEqual(len(self.vectors.most_similar('war', topn=5)), 5)
Expand Down