Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

Open
michaelwsherman opened this issue Jul 5, 2017 · 0 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@michaelwsherman
Copy link
Contributor

In one of the word2vec papers, a method for learning phrases is detailed, and this method is implemented in models.phrases.Phrases. One aspect of the method is currently not supported:

Typically, we run 2-4 passes over the training data with decreasing threshold value, allowing longer phrases that consists of several words to be formed.

With the current implementation of models.phrases.Phrases, lowering the threshold in Phrases to find n-grams where n > 2 does not work--you will also learn all 2-grams that meet the new lower threshold. This menas you cannot learn n > 2 ngrams unless you keep the threshold constant or raise it as you increase n.

I implemented one workaround for this, which involved changing the delimiter as I built successive Phrases objects to learn increasing n. A check for the existence of the previous n's delimiter was added to export_phrases, so only phrases where one of the words was found for the previous n were found. But this is an ugly solution.

A better solution might be to use an existing Phrases (or Phraser) object to generate an optional "whitelist" when initializing a Phrases object. Then, only phrases with at least one term in the whitelist would be considered as possible phrases by the new Phrases object. (Generating the whitelist could maybe just be a call to whatever addresses #1465). But this also means another potentially very large dictionary in the Phrases object, as well as many lookups to this dictionary when learning/exporting phrases.

If there's some kind of consensus about the best way to add this support, I can maybe give this a try.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

2 participants