Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

Closed
michaelwsherman opened this issue May 24, 2017 · 6 comments
Labels
wishlist Feature request

Comments

@michaelwsherman
Copy link
Contributor

Based on this mailing list discussion, it seems desirable if models.Phrases supported alternative methods of scoring potential n-grams instead of forcing the use of the metric in the original word2vec paper (Section 4).

I have some changes I've made to models.Phrases that support an optional scoring function as an instance variable, which is then used in place of the default scoring. This is a relatively easy change to make, since it just replaces the score calculation in models.Phrases.export_phrases with a call to the scoring function.

I've also implemented normalized pointwise mutual information as an alternative scoring method in models.Phrases.export_phrases. It does, however, require an additional instance variable to track the corpus size, which is incremented as part of add_vocab.

If these changes are desirable, let me know and I'll submit a pull request. Otherwise, thanks for reading and close away :).

@gojomo
Copy link
Collaborator

gojomo commented May 25, 2017

Pluggable scoring sounds interesting and useful. The docs and refactoring of Phrases/etc would have to be handled carefully to avoid user confusion or a drag on performance for people using the classic scoring.

I believe there may be a Google Summer-of-Code project to make improvements to gensim's Phrases support – which may be focused on optimizations but might be able to consider other features/refactoring.

A well-designed, conscientious refactoring might also help resolve the open question on #1263/#1258 – whether that functionality should be parameterized in the main class, or in a clearly distinct class. (Something like: swappable creators of candidate-phrases, and swappable scorers.)

@tmylk should comment, but I'd think a PR could be useful even if it might take a while to integrate/reconcile into the main package. (It could be a useful model to others even before any full merge.)

@piskvorky
Copy link
Owner

piskvorky commented May 25, 2017

Sounds good to me.

One thing to look out for: objects must be serializable, so the "injected function as a parameter" must be named (pickle cannot handle lambdas).

@menshikh-iv
Copy link
Contributor

Maybe create PMI & NPMI as methods in Phrases and pass a string with a name to constructor (and match it to the method in __init__). This is less flexible, but avoid serialization problems

@michaelwsherman
Copy link
Contributor Author

Right now the code is what @menshikh-iv to try--the scoring methods are specified with a text parameter that needs to be either 'mikolov' or 'npmi'. If this is a good starting point I'll make the pull request--give me a week or two to figure things out since this is the first time I'll be contributing to open source.

@piskvorky, you raise a good point. Is there any easy way to test if a function is serializable (other than to try to pickle it and load it?). If it's an easy change I can make my existing code into pluggable functions and add the check as part of the constructor.

@menshikh-iv menshikh-iv added the wishlist Feature request label May 27, 2017
@piskvorky
Copy link
Owner

piskvorky commented May 29, 2017

-1 on passing strings instead of functions. Just pass the function as a parameter, directly -- more flexible.

@michaelwsherman no such check is needed. Python is duck-typed, so it's up to the user to pass parameters of the correct "type". And I can imagine use-cases where no serialization is needed (user doesn't need to save/load), so a forceful check would be harmful.

@gojomo
Copy link
Collaborator

gojomo commented May 29, 2017

@michaelwsherman I believe functions are serializable if-and-only-if they're defined globally (at the 'top level'). In that case pickle manages to write them as their global name; then if during unpickling the same function is (already imported) as the same global name, all works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist Feature request
Projects
None yet
Development

No branches or pull requests

4 participants