Scoring function in Phrases model is hardcoded #1635

hristo-vrigazov · 2017-10-18T21:39:38Z

The Phrases model is based on word counting and bigram counting and it can process sentences by a given scoring function, which can be supplied via the construtor of Phrases (the parameter scoring). However, the field for scoring function is used only in the export_phrases method. Have a look here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L269
and here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L284

count_a = float(vocab[word_a])
count_b = float(vocab[word_b])
count_ab = float(vocab[bigram_word])
score = scoring_function(count_a, count_b, count_ab)

However, in the __getitem__ method, the scoring uses the default scoring always https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L334 :

pa = float(vocab[word_a])
pb = float(vocab[word_b])
pab = float(vocab[bigram_word])
score = (pab - min_count) / pa / pb * len(vocab)

This looks like a bug to me (we are always using the default scoring, even if we explicitly stated npmi in the constructor). Is it okay if I open a pull request fixing this one?

The text was updated successfully, but these errors were encountered:

gojomo · 2017-10-19T04:29:20Z

@michaelwsherman thoughts?

menshikh-iv · 2017-10-19T06:58:09Z

This problem fixed in #1573 from @michaelwsherman
Before merge #1573 I think we should merge #1568 -> resolve conflicts in #1573, wdyt @gojomo?

gojomo · 2017-10-19T18:09:05Z

I have no opinion on which should merge first.

michaelwsherman · 2017-10-23T18:55:29Z

I'm of the opinion that #1573 should be merged first :), but I'll wait for #1568 -- just be aware that it could easily be a few weeks after #1568 until I merge the code--looks like it will be a meaty merge. If it's important that the merge happen quickly then maybe #1573 should merge first since #1568 looks more active right now. But happy to make whatever work if y'all don't mind the wait.

menshikh-iv · 2017-10-24T09:58:13Z

Ok, let's merge #1573 first, thanks for clarification @michaelwsherman.

@piskvorky

* initial commit of fixes in comments of #1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring #1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring

menshikh-iv · 2017-10-24T12:33:56Z

Resolved in #1573

@piskvorky

…iskvorky#1573) * initial commit of fixes in comments of piskvorky#1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 19, 2017

menshikh-iv closed this as completed Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring function in Phrases model is hardcoded #1635

Scoring function in Phrases model is hardcoded #1635

hristo-vrigazov commented Oct 18, 2017

gojomo commented Oct 19, 2017

menshikh-iv commented Oct 19, 2017

gojomo commented Oct 19, 2017

michaelwsherman commented Oct 23, 2017

menshikh-iv commented Oct 24, 2017

menshikh-iv commented Oct 24, 2017

Scoring function in Phrases model is hardcoded #1635

Scoring function in Phrases model is hardcoded #1635

Comments

hristo-vrigazov commented Oct 18, 2017

gojomo commented Oct 19, 2017

menshikh-iv commented Oct 19, 2017

gojomo commented Oct 19, 2017

michaelwsherman commented Oct 23, 2017

menshikh-iv commented Oct 24, 2017

menshikh-iv commented Oct 24, 2017