Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for evaluating threshold settings in models.phrases.Phrases #1465

Closed
michaelwsherman opened this issue Jul 5, 2017 · 6 comments · Fixed by #2979
Closed

Better support for evaluating threshold settings in models.phrases.Phrases #1465

michaelwsherman opened this issue Jul 5, 2017 · 6 comments · Fixed by #2979
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@michaelwsherman
Copy link
Contributor

One of the challenges when learning bigrams from a new corpus is determining the right threshold for your scoring to accept or reject a potential bigram. Standard approaches involve taking a list of gold-labeled bigrams created by humans, ranking all bigrams in your corpus by their score, and determining a score threshold based on a comparison to the gold labels. (for one example, see this paper

Right now, doing this with built in functionality in models.Phrases requires running export_phrases on your whole corpus. For an especially large corpus, this could mean a lot of wasted time waiting for export_phrases to run. It also means getting some strange results--if you set your threshold very low, a strong bigram in your corpus may not be output by export_phrases if the first word in the strong bigram is especially rare and the strong bigram is proceeded by a limited set of words (you'd get a bunch of lower scoring bigrams with the first word of the strong bigram as the second word of some weaker bigrams).

There should be a method that only traverses the vocab dictionary and returns something that shows the scores for the bigrams in the corpus. This would be both faster than export_phrases and would ensure that all bigrams (that exceed some threshold) have their score output. I have code to do something like this, and I'm happy to contribute it. (Although it might make sense to wait until after #1464 and maybe #1446 are finalized.)

@gojomo
Copy link
Collaborator

gojomo commented Jul 6, 2017

Sounds useful!

@michaelwsherman
Copy link
Contributor Author

@gojomo Thanks. I'll add after the scoring PR #1464 is accepted (or rejected) since that could affect how this method works, as it depends on the scoring. Right now export_phrases is the only method in phrases that actually calculates scores, this will have to calculate scores as well.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request labels Oct 2, 2017
@piskvorky
Copy link
Owner

piskvorky commented Oct 10, 2020

@michaelwsherman implemented in #2979: Phrases.export_phrases() now exports all phrases (that pass the threshold), it doesn't need any corpus.

The "old" functionality of finding phrases in a corpus was renamed to Phrases.find_phrases(corpus).

@gojomo
Copy link
Collaborator

gojomo commented Oct 10, 2020

Another request that comes up from time to time: add some hand-selected bigrams (or longer) that a user has independently determined they want as phrases. It might be interesting to offer some tuning tools/methods that report: "if you want X, Y, Z to be phrasified, you'd have to set the parameters to N, M, etc, but then the top-20 most-marginal phrases would be {P1, P2, ...}" (so they'd see the side effects of those settings).

Alternatively, it might be possible (and now easier/cleaner with the #2976/etc refactorings) to add some exception set of 'forced' user choices that always combine regardless of their score, or conversely some exception set of 'suppressed' that never combine after the user notices they're unwanted, that meet some user needs. (Though perhaps, such exception-lists are a fool's errand given the inherent roughness of this technique, which in my experience often improves the raw texts passed into IR/classification steps but is rarely conformant enough to human-level perceived phrases that you'd want to show the combinations to average users.)

(For people who don't need any bulk statistical phrase discovery, but just a preprocessing step that applies their hand-chosen phrases, some users don't realize that's pretty easy in Python; adding some code that only does that might be a nice preprocessing utility as well -- see for example my demo code in an SO answer: https://stackoverflow.com/questions/58839049/python-connect-composed-keywords-in-texts/58864397#58864397)

@piskvorky
Copy link
Owner

piskvorky commented Oct 10, 2020

Wouldn't this work?

phrases = Phrases(…)
frozen_phrases = phrases.freeze()
frozen_phrases.phrasegrams['my_phrase'] = float('inf')

Likewise for a blacklist – remove the offending key del frozen_phrases.phrasegrams['not_this'].

If that's all that is needed then I'd leave it in userland / FAQ, no need for any special API.

@gojomo
Copy link
Collaborator

gojomo commented Oct 11, 2020

Sure! That'd be great in a help page/recipe, or as convenience methods to 'force' or 'delete' specific phrases in a frozen model. (Though, it strikes me that people may want to make the decision once that phrase X should either 'always' or 'never' be created - without having to potentially re-patch the model after any incremental training that might have undone a previous forced-score or manual deletion.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants