-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add 'Word Mover's Distance' implementation to gensim? #482
Comments
CC @tmylk: backlog of open source tasks. |
@olavurmortensen has taken up this issue now and is working on the integration of WMD into gensim. Regarding the "unclear license": let's clarify with @mkusner. We definitely don't want to include anything without a license / LGPL non-compatible. |
This may be a preferable C codebase to build on – http://www.ariel.ac.il/sites/ofirpele/FastEMD/code/ – as the authors of it and the related paper make a case for it (and their related "EMD-hat" metric) being faster/better than earlier methods, and it has a clear BSD license. There's also an existing python wrapper for it: https://github.com/wmayner/pyemd That
|
I found an odd thing in both implementations implementation. If I add the same word to both sentences, the distance decreases; example: "two three" and "four five" are more distant than "two three one" and "four five "one". From my understanding of EMD, this does not make sense, as the distance from "one" to "one" is zero, and thus should add nothing to the EMD. I'm working on this here at the moment. |
Hey all! I would absolutely love to include WMD in gensim. It's such a great package!! In terms of an implementation, Vlad Niculae (https://github.com/vene) at Cornell has made a very awesome scikit_learn-friendly version of the Word Mover's Distance that is much more robust and well-documented than my original code release. He just released a blog post with this code here: http://vene.ro/blog/word-movers-distance-in-python.html Maybe @olavurmortensen can merge any code with this that isn't overlapping? It's very exciting to have all of these implementations! As to the comment @olavurmortensen raised about the distance decreasing when "one" is added to both documents: this is because WMD currently normalizes the word counts by the total number of words in the document. Thus, you are right that the distance between "one" and "one" zero, but the distance between the two documents without "one" is: (where d("two") is the Euclidean distance between vector representing "two" and either the vector for "four" or "five" depending on the optimal EMD distance, and same for d("three")) And the distance with "one" is: D = 0.33_d("two") + 0.33_d("three") + 0.33*0 So the distance shrinks. For us it made sense to think of document counts as a distribution over words and the WMD as the cost of morphing the distribution of one document into another. But if this isn't what you're looking for then you can remove this normalization and you'll get the behavior you're looking for. Let me know if you have any more questions about this. Let us know about what would be next best to do to work WMD code into gensim! |
I see @vene's work also uses @wmayner's The treatment where adding 'one' to both distributions results in a smaller-distance better fits my intuitive understanding of what we'd usually like the WMD to do, when doing text-similarity. Two sentences that share more words should be closer: their meanings have greater overlap. The reason I'd like to see |
Great to have a working reference implementation! Thanks so much for all the info and tips @mkusner :) @vene are you OK with using parts of your code for gensim? I haven't checked your blog post in detail yet (in a minute!), but I'm thinking some parts may be directly transferable. I'm not excited about the duplication of effort obviously, but WMD seems like a very useful functionality. We definitely want this in "core gensim", rather than relying on external code / external lib release cycles or APIs. |
Sure, I'd be happy if my code could help, and I could also lend a hand maybe. @gojomo I really like your thoughts on extracting the actual flows/matchings! WMD, as it is, is great for KNN classification, but I tried using it for feature extraction between pairs of texts, and I didn't get results as exciting as it might have been with access to the flow. About penalty for unmatched mass, I'm not sure the intuition is, I'd love to hear your thoughts! This is all really exciting and it would be great to collaborate on this! |
Here's a plan that I worked out with @tmylk:
I will be drawing inspiration from @vene's blog during the process. |
Hey folks, Really cool to see that Getting the flows from the algorithm is something others have expressed interest in as well. Unfortunately I'm super busy at the moment and I won't be able to work on that anytime soon, but it is possible to pass a NumPy array to a C++ function via Cython so that the C++ function can then fill it with data—see, for example, this StackOverflow question—so implementing it shouldn't be too difficult. I'd be happy to help with a PR if someone wants to give it a shot. @olavurmortensen, check the docstring for the Cheers, |
Thanks @wmayner ! What are your future plans regarding |
I'm working on this on a branch of my own fork of Gensim, and have submitted a PR. |
@piskvorky, it's unlikely that our use of the EMD will change significantly in the foreseeable future, so right now I'm not planning on doing much more work on Cheers, |
Hi, |
@ddofer This is partly true. Gensim prepares the text data to be input to a solver in the To my knowledge there is no plan to implement a EMD solver directly in Gensim. Did this answer your question @ddofer? |
Noticed a small error in WMD_tutorial.ipynb: it currently suggests about the GoogleNews vectors: "It just so happens that the vectors we have downloaded are already normalized, so it won't do any difference in this case." In fact they're not natively normed – only a similarity-operation or explicit |
@RishabGoel Could you please update the ipynb tutorial with this correction? |
@tmylk Sure! |
@RishabGoel @tmylk any progress? |
By the way, @gojomo the flows are available in PyEMD 0.4.1 |
@tmylk, @gojomo, @piskvorky, I just updated the requirements and pushed a new version—PyEMD 0.4.2 now works with Python 3.3 and NumPy 1.9, so you should be able to get the flows in Genism now 👍 |
@wmayner Thanks. Confirm it works. Version unpinned. Looking forward to using the flows. |
Implemented by @olavurmortensen in #521 |
"From Word Embeddings to Document Distances" (Kusner et al 2015: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf) introduces the "Word Mover's Distance" (WMD), a novel distance-between-text-documents measure. It is an adaptation of another distance metric, "Earth Mover's Distance" (EMD), introduced(?) in "A Metric for Distributions with Applications
to Image Databases" (Rubner et al 1998: http://www.cs.jhu.edu/~misha/Papers/Rubner98.pdf), that is useful for comparing images and can be calculated as a special case of much-older transportation problem optimizations. For text, the WMD leverages word2vec vectors of the documents' individual words, in a way that seems to outperform simple combinations (sum/mean) of those word vectors.
There's a blog post from OpenTable with some impressive examples of "similar sentences" using WMD on restaurant reviews: http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
The paper reports strong kNN classifier results on document classification, and intuitively, the method seems amenable to the selection (or even synthesis) of 'canonical' or 'borderline/contrastive' text segments, perhaps to assist iterative classification tasks.
The author's code is available as a ZIP bundle (http://matthewkusner.com/#page2) – but its python wrappers depend on some older C EMD code of unclear licensing status. There's a bunch of other EMD code around for other projects (esp. image similarity measures) that also might be a good starting point.
An implementation (perhaps in optimized cython) for gensim might be widely useful.
(Further out: it might be interesting to use WMD to induce a document-embedding. That is, train doc embeddings with random draws of 3 documents, nudging the embeddings so that the relative which-two-are-closest is the same in the embedding space as in the WMD metric.)
The text was updated successfully, but these errors were encountered: