Better support for evaluating threshold settings in models.phrases.Phrases #1465

michaelwsherman · 2017-07-05T19:21:58Z

One of the challenges when learning bigrams from a new corpus is determining the right threshold for your scoring to accept or reject a potential bigram. Standard approaches involve taking a list of gold-labeled bigrams created by humans, ranking all bigrams in your corpus by their score, and determining a score threshold based on a comparison to the gold labels. (for one example, see this paper

Right now, doing this with built in functionality in models.Phrases requires running export_phrases on your whole corpus. For an especially large corpus, this could mean a lot of wasted time waiting for export_phrases to run. It also means getting some strange results--if you set your threshold very low, a strong bigram in your corpus may not be output by export_phrases if the first word in the strong bigram is especially rare and the strong bigram is proceeded by a limited set of words (you'd get a bunch of lower scoring bigrams with the first word of the strong bigram as the second word of some weaker bigrams).

There should be a method that only traverses the vocab dictionary and returns something that shows the scores for the bigrams in the corpus. This would be both faster than export_phrases and would ensure that all bigrams (that exceed some threshold) have their score output. I have code to do something like this, and I'm happy to contribute it. (Although it might make sense to wait until after #1464 and maybe #1446 are finalized.)

gojomo · 2017-07-06T23:11:03Z

Sounds useful!

michaelwsherman · 2017-07-07T20:26:36Z

@gojomo Thanks. I'll add after the scoring PR #1464 is accepted (or rejected) since that could affect how this method works, as it depends on the scoring. Right now export_phrases is the only method in phrases that actually calculates scores, this will have to calculate scores as well.

piskvorky · 2020-10-10T19:52:37Z

@michaelwsherman implemented in #2979: Phrases.export_phrases() now exports all phrases (that pass the threshold), it doesn't need any corpus.

The "old" functionality of finding phrases in a corpus was renamed to Phrases.find_phrases(corpus).

gojomo · 2020-10-10T22:19:40Z

Another request that comes up from time to time: add some hand-selected bigrams (or longer) that a user has independently determined they want as phrases. It might be interesting to offer some tuning tools/methods that report: "if you want X, Y, Z to be phrasified, you'd have to set the parameters to N, M, etc, but then the top-20 most-marginal phrases would be {P1, P2, ...}" (so they'd see the side effects of those settings).

Alternatively, it might be possible (and now easier/cleaner with the #2976/etc refactorings) to add some exception set of 'forced' user choices that always combine regardless of their score, or conversely some exception set of 'suppressed' that never combine after the user notices they're unwanted, that meet some user needs. (Though perhaps, such exception-lists are a fool's errand given the inherent roughness of this technique, which in my experience often improves the raw texts passed into IR/classification steps but is rarely conformant enough to human-level perceived phrases that you'd want to show the combinations to average users.)

(For people who don't need any bulk statistical phrase discovery, but just a preprocessing step that applies their hand-chosen phrases, some users don't realize that's pretty easy in Python; adding some code that only does that might be a nice preprocessing utility as well -- see for example my demo code in an SO answer: https://stackoverflow.com/questions/58839049/python-connect-composed-keywords-in-texts/58864397#58864397)

piskvorky · 2020-10-10T22:30:54Z

Wouldn't this work?

phrases = Phrases(…)
frozen_phrases = phrases.freeze()
frozen_phrases.phrasegrams['my_phrase'] = float('inf')

Likewise for a blacklist – remove the offending key del frozen_phrases.phrasegrams['not_this'].

If that's all that is needed then I'd leave it in userland / FAQ, no need for any special API.

gojomo · 2020-10-11T05:17:54Z

Sure! That'd be great in a help page/recipe, or as convenience methods to 'force' or 'delete' specific phrases in a frozen model. (Though, it strikes me that people may want to make the decision once that phrase X should either 'always' or 'never' be created - without having to potentially re-patch the model after any incremental training that might have undone a previous forced-score or manual deletion.)

michaelwsherman mentioned this issue Jul 5, 2017

Impossible to roll up phrases of more than 2 words while lowering the threshold #1466

Open

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request labels Oct 2, 2017

piskvorky mentioned this issue Oct 10, 2020

[MRG] Skip common English words in phrases #2979

Merged

piskvorky closed this as completed in #2979 Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for evaluating threshold settings in models.phrases.Phrases #1465

Better support for evaluating threshold settings in models.phrases.Phrases #1465

michaelwsherman commented Jul 5, 2017

gojomo commented Jul 6, 2017

michaelwsherman commented Jul 7, 2017

piskvorky commented Oct 10, 2020 •

edited

Loading

gojomo commented Oct 10, 2020

piskvorky commented Oct 10, 2020 •

edited

Loading

gojomo commented Oct 11, 2020 •

edited

Loading

Better support for evaluating threshold settings in models.phrases.Phrases #1465

Better support for evaluating threshold settings in models.phrases.Phrases #1465

Comments

michaelwsherman commented Jul 5, 2017

gojomo commented Jul 6, 2017

michaelwsherman commented Jul 7, 2017

piskvorky commented Oct 10, 2020 • edited Loading

gojomo commented Oct 10, 2020

piskvorky commented Oct 10, 2020 • edited Loading

gojomo commented Oct 11, 2020 • edited Loading

piskvorky commented Oct 10, 2020 •

edited

Loading

piskvorky commented Oct 10, 2020 •

edited

Loading

gojomo commented Oct 11, 2020 •

edited

Loading