Impossible to roll up phrases of more than 2 words while lowering the threshold #1466
Labels
difficulty medium
Medium issue: required good gensim understanding & python skills
feature
Issue described a new feature
In one of the word2vec papers, a method for learning phrases is detailed, and this method is implemented in models.phrases.Phrases. One aspect of the method is currently not supported:
With the current implementation of models.phrases.Phrases, lowering the threshold in Phrases to find n-grams where n > 2 does not work--you will also learn all 2-grams that meet the new lower threshold. This menas you cannot learn n > 2 ngrams unless you keep the threshold constant or raise it as you increase n.
I implemented one workaround for this, which involved changing the delimiter as I built successive Phrases objects to learn increasing n. A check for the existence of the previous n's delimiter was added to export_phrases, so only phrases where one of the words was found for the previous n were found. But this is an ugly solution.
A better solution might be to use an existing Phrases (or Phraser) object to generate an optional "whitelist" when initializing a Phrases object. Then, only phrases with at least one term in the whitelist would be considered as possible phrases by the new Phrases object. (Generating the whitelist could maybe just be a call to whatever addresses #1465). But this also means another potentially very large dictionary in the Phrases object, as well as many lookups to this dictionary when learning/exporting phrases.
If there's some kind of consensus about the best way to add this support, I can maybe give this a try.
The text was updated successfully, but these errors were encountered: