Skip to content

Commit

Permalink
Update CHANGELOG.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mpenkov authored Mar 9, 2019
1 parent cebc9db commit e69e112
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,14 @@ If you want to check if a word is an in-vocabulary term, use this instead:
>>> model = FastText.load_fasttext_format(cap_path, full_model=False)
>>> 'steamtrain' in model.wv.vocab # If False, is an OOV term
False

There are several important consequences of the above change:

1. `'any_word' in model` will always return `True`. Previously, it returned `True` only if the word was in the vocabulary.
2. `model['any_word']` will always return a vector. Previously, it raised `KeyError` for OOV words when the model had no vectors for **any** ngrams of the word.
3. Higher demand on CPU and memory, because this change reverts an [optimization](https://github.com/RaRe-Technologies/gensim/pull/1916#issuecomment-369171508) that sacrificed compatibility and correctness for lower CPU and memory demand.

The main motivation behind this change was consistency with the reference implementation from Facebook.

#### Loading models in Facebook .bin format

Expand Down

2 comments on commit e69e112

@gojomo
Copy link
Collaborator

@gojomo gojomo commented on e69e112 Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This better covers implications for users, thanks! But for clarity, I would lead with the "main motivation", and cover the workaround (for testing if a word is OOV) inline in point (1).

Separately, I'm not sure the original only-retain-seen-ngrams behavior really was a memory or speed optimization for important workloads, like larger corpuses. For example:

(1) With large corpuses, the collision-oblivious fixed-size hashtable of ngrams could have most of its slots trained by one or more ngrams. So, slimming it to just used buckets wouldn't save much memory. (It's only when the default bucket count is far out-of-proportion to the number of ngrams seen that the prior optimization would save a lot of memory.)
(2) Meanwhile, maintaining/consulting that list/slot-lookup-dict of seen-ngrams required some new memory (and computation) not needed to the FastText-comformant behavior. And while with a small corpus, an OOV word might have few ngram-hits, so skipping the vector-lookup/vector-averaging of all the missing potential ngrams might speed things up, in larger corpuses, an ever-greater proportion of ngrams will in fact be 'known'.

So I wouldn't be surprised if that particular "optimization" helped with toy-sized data (like say 100MB text8), but used more memory and computation on other corpuses (like say 10GB+ Wikipedia dumps). Looking at last year's #1916, I think reported speedups there were from cythonization and other tricks, not retaining the seen-ngrams. I'm a bit confused because the descriptive text at #1916 suggests that PR was already, as of May 2018, removing the model's cache of seen-ngrams. Yet the persistence of KeyError: all ngrams for word X absent from model errors through gensim 3.7.1 suggests that PR #1916 did not have that effect.

In any case, I wouldn't characterize this fix as necessarily a memory/speed hit unless that's freshly demonstrated on real workloads.

So I'd word the change notes as:

#### Out-of-vocab (OOV) word handling

To achieve consistency with the reference implementation from Facebook,
a `FastText` model will now always report any word, out-of-vocabulary or 
not, as being in the model,  and always return some vector for any word 
looked-up. Specifically:

1. `'any_word' in ft_model` will always return `True`.  Previously, it 
returned `True` only if the full word was in the vocabulary. (To test if a 
full word is in the known vocabulary, you can consult the `wv.vocab` 
property: `'any_word' in ft_model.wv.vocab` will return `False` if the full 
word wasn't learned during model training.)
2. `ft_model['any_word']` will always return a vector.  Previously, it 
raised `KeyError` for OOV words when the model had no vectors 
for **any** ngrams of the word.
3. Models may use more more memory, or take longer for word-vector
lookup, especially after training on smaller corpuses where the previous 
non-compliant behavior discarded some ngrams from consideration.  

@mpenkov
Copy link
Collaborator Author

@mpenkov mpenkov commented on e69e112 Mar 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added a new commit, please have a look.

Please sign in to comment.