Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

michaelwsherman · 2017-07-05T19:40:45Z

In one of the word2vec papers, a method for learning phrases is detailed, and this method is implemented in models.phrases.Phrases. One aspect of the method is currently not supported:

Typically, we run 2-4 passes over the training data with decreasing threshold value, allowing longer phrases that consists of several words to be formed.

With the current implementation of models.phrases.Phrases, lowering the threshold in Phrases to find n-grams where n > 2 does not work--you will also learn all 2-grams that meet the new lower threshold. This menas you cannot learn n > 2 ngrams unless you keep the threshold constant or raise it as you increase n.

I implemented one workaround for this, which involved changing the delimiter as I built successive Phrases objects to learn increasing n. A check for the existence of the previous n's delimiter was added to export_phrases, so only phrases where one of the words was found for the previous n were found. But this is an ugly solution.

A better solution might be to use an existing Phrases (or Phraser) object to generate an optional "whitelist" when initializing a Phrases object. Then, only phrases with at least one term in the whitelist would be considered as possible phrases by the new Phrases object. (Generating the whitelist could maybe just be a call to whatever addresses #1465). But this also means another potentially very large dictionary in the Phrases object, as well as many lookups to this dictionary when learning/exporting phrases.

If there's some kind of consensus about the best way to add this support, I can maybe give this a try.

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

michaelwsherman commented Jul 5, 2017

Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

Comments

michaelwsherman commented Jul 5, 2017