Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add 'Word Mover's Distance' implementation to gensim? #482

Closed
gojomo opened this issue Oct 15, 2015 · 24 comments
Closed

add 'Word Mover's Distance' implementation to gensim? #482

gojomo opened this issue Oct 15, 2015 · 24 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@gojomo
Copy link
Collaborator

gojomo commented Oct 15, 2015

"From Word Embeddings to Document Distances" (Kusner et al 2015: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf) introduces the "Word Mover's Distance" (WMD), a novel distance-between-text-documents measure. It is an adaptation of another distance metric, "Earth Mover's Distance" (EMD), introduced(?) in "A Metric for Distributions with Applications
to Image Databases" (Rubner et al 1998: http://www.cs.jhu.edu/~misha/Papers/Rubner98.pdf), that is useful for comparing images and can be calculated as a special case of much-older transportation problem optimizations. For text, the WMD leverages word2vec vectors of the documents' individual words, in a way that seems to outperform simple combinations (sum/mean) of those word vectors.

There's a blog post from OpenTable with some impressive examples of "similar sentences" using WMD on restaurant reviews: http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

The paper reports strong kNN classifier results on document classification, and intuitively, the method seems amenable to the selection (or even synthesis) of 'canonical' or 'borderline/contrastive' text segments, perhaps to assist iterative classification tasks.

The author's code is available as a ZIP bundle (http://matthewkusner.com/#page2) – but its python wrappers depend on some older C EMD code of unclear licensing status. There's a bunch of other EMD code around for other projects (esp. image similarity measures) that also might be a good starting point.

An implementation (perhaps in optimized cython) for gensim might be widely useful.

(Further out: it might be interesting to use WMD to induce a document-embedding. That is, train doc embeddings with random draws of 3 documents, nudging the embeddings so that the relative which-two-are-closest is the same in the embedding space as in the WMD metric.)

@piskvorky
Copy link
Owner

CC @tmylk: backlog of open source tasks.

@piskvorky piskvorky added feature Issue described a new feature wishlist Feature request difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 15, 2015
@tmylk
Copy link
Contributor

tmylk commented Oct 15, 2015

@piskvorky
Copy link
Owner

@olavurmortensen has taken up this issue now and is working on the integration of WMD into gensim.

Regarding the "unclear license": let's clarify with @mkusner. We definitely don't want to include anything without a license / LGPL non-compatible.

@gojomo
Copy link
Collaborator Author

gojomo commented Nov 7, 2015

This may be a preferable C codebase to build on – http://www.ariel.ac.il/sites/ofirpele/FastEMD/code/ – as the authors of it and the related paper make a case for it (and their related "EMD-hat" metric) being faster/better than earlier methods, and it has a clear BSD license.

There's also an existing python wrapper for it: https://github.com/wmayner/pyemd

That pyemd project is missing interfaces to some of the interesting capabilities of the underlying code, but they might be pretty easy to add:

  • an adjustable penalty for excess 'unmoved' mass (there's an unmerged PR for that)
  • access to the exact best 'flows' (issue discussing: Getting the flows wmayner/pyemd#2)

@piskvorky piskvorky removed the wishlist Feature request label Nov 7, 2015
@olavurmortensen
Copy link
Contributor

pyemd is able to handle sequences of differing lengths, which mkusner's version is not. Although, at this point I'm guessing what the correct input to pyemd's emd function are, because I can't find an explanation.

I found an odd thing in both implementations implementation. If I add the same word to both sentences, the distance decreases; example: "two three" and "four five" are more distant than "two three one" and "four five "one". From my understanding of EMD, this does not make sense, as the distance from "one" to "one" is zero, and thus should add nothing to the EMD.

I'm working on this here at the moment.

@mkusner
Copy link

mkusner commented Nov 7, 2015

Hey all! I would absolutely love to include WMD in gensim. It's such a great package!! In terms of an implementation, Vlad Niculae (https://github.com/vene) at Cornell has made a very awesome scikit_learn-friendly version of the Word Mover's Distance that is much more robust and well-documented than my original code release. He just released a blog post with this code here: http://vene.ro/blog/word-movers-distance-in-python.html Maybe @olavurmortensen can merge any code with this that isn't overlapping? It's very exciting to have all of these implementations!

As to the comment @olavurmortensen raised about the distance decreasing when "one" is added to both documents: this is because WMD currently normalizes the word counts by the total number of words in the document. Thus, you are right that the distance between "one" and "one" zero, but the distance between the two documents without "one" is:
D = 0.5_d("two") + 0.5_d("three")

(where d("two") is the Euclidean distance between vector representing "two" and either the vector for "four" or "five" depending on the optimal EMD distance, and same for d("three")) And the distance with "one" is:

D = 0.33_d("two") + 0.33_d("three") + 0.33*0

So the distance shrinks. For us it made sense to think of document counts as a distribution over words and the WMD as the cost of morphing the distribution of one document into another. But if this isn't what you're looking for then you can remove this normalization and you'll get the behavior you're looking for. Let me know if you have any more questions about this.

Let us know about what would be next best to do to work WMD code into gensim!

@gojomo
Copy link
Collaborator Author

gojomo commented Nov 7, 2015

I see @vene's work also uses @wmayner's pyemd wrapper of Pele & Wermen's code.

The treatment where adding 'one' to both distributions results in a smaller-distance better fits my intuitive understanding of what we'd usually like the WMD to do, when doing text-similarity. Two sentences that share more words should be closer: their meanings have greater overlap.

The reason I'd like to see pyemd (and gensim) offer an interface into the handling of unmatched 'mass' and the net 'flows' is that I have a hunch that could offer some interesting possibilities for further characterizing the difference between texts. (For example the unmatched mass, or the flows-between-most-distant-word-pairs, may be useful as a more-than-just-scalar indication of what topics are unique to each side of a comparison.)

@piskvorky
Copy link
Owner

Great to have a working reference implementation! Thanks so much for all the info and tips @mkusner :)

@vene are you OK with using parts of your code for gensim? I haven't checked your blog post in detail yet (in a minute!), but I'm thinking some parts may be directly transferable.

I'm not excited about the duplication of effort obviously, but WMD seems like a very useful functionality. We definitely want this in "core gensim", rather than relying on external code / external lib release cycles or APIs.

@vene
Copy link

vene commented Nov 8, 2015

Sure, I'd be happy if my code could help, and I could also lend a hand maybe.

@gojomo I really like your thoughts on extracting the actual flows/matchings! WMD, as it is, is great for KNN classification, but I tried using it for feature extraction between pairs of texts, and I didn't get results as exciting as it might have been with access to the flow.

About penalty for unmatched mass, I'm not sure the intuition is, I'd love to hear your thoughts!

This is all really exciting and it would be great to collaborate on this!

@olavurmortensen
Copy link
Contributor

Here's a plan that I worked out with @tmylk:

  • Implement a method for calculating the WMD in Gensim, including the pyemd EMD wrapper in Gensim.
  • Implement a kNN classifier using WMD and document similarities in Gensim.

I will be drawing inspiration from @vene's blog during the process.

@wmayner
Copy link

wmayner commented Nov 9, 2015

Hey folks,

Really cool to see that pyemd might be useful for your project! Because of the interest, I finally got around to merging the PR that @vbraun submitted to expose the extra_mass_penalty argument. I just pushed v0.2.0 with that addition.

Getting the flows from the algorithm is something others have expressed interest in as well. Unfortunately I'm super busy at the moment and I won't be able to work on that anytime soon, but it is possible to pass a NumPy array to a C++ function via Cython so that the C++ function can then fill it with data—see, for example, this StackOverflow question—so implementing it shouldn't be too difficult. I'd be happy to help with a PR if someone wants to give it a shot.

@olavurmortensen, check the docstring for the emd function—it tells you what the correct input is (you can use help(emd), or if you're using ipython, you can do emd?).

Cheers,
Will

@piskvorky
Copy link
Owner

Thanks @wmayner !

What are your future plans regarding pyemd? @tmylk is considering subsuming (some of) pyemd code inside gensim. It will make the release cycles and custom adjustments and maintenance easier for us, but adds extra burden on "syncing" potential changes from your upstream repo. So I'm not sure.

@olavurmortensen
Copy link
Contributor

I'm working on this on a branch of my own fork of Gensim, and have submitted a PR.

@wmayner
Copy link

wmayner commented Nov 10, 2015

@piskvorky, it's unlikely that our use of the EMD will change significantly in the foreseeable future, so right now I'm not planning on doing much more work on pyemd. I'd say forking it and merging with your codebase so you have more control makes more sense, since there won't be many upstream changes to incorporate, if any.

Cheers,
Will

tmylk added a commit that referenced this issue Apr 5, 2016
@ddofer
Copy link

ddofer commented Jul 27, 2016

Hi,
I heard yesterday from Gensim's community manager in Israel that you guys have added an implementation of WMD (atop Word Vectors) in Gensim; From what I see here it seems that it's not actually implemented yet in gensim itself?

@olavurmortensen
Copy link
Contributor

@ddofer This is partly true. Gensim prepares the text data to be input to a solver in the pyemd library. This solver finds the EMD distance (which inspired WMD), and essentially solves an optimization problem known as the "transportation problem".

To my knowledge there is no plan to implement a EMD solver directly in Gensim.

Did this answer your question @ddofer?

@gojomo
Copy link
Collaborator Author

gojomo commented Aug 3, 2016

Noticed a small error in WMD_tutorial.ipynb: it currently suggests about the GoogleNews vectors: "It just so happens that the vectors we have downloaded are already normalized, so it won't do any difference in this case." In fact they're not natively normed – only a similarity-operation or explicit init_sims() creates the normed version.

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

@RishabGoel Could you please update the ipynb tutorial with this correction?

@RishabGoel
Copy link
Contributor

@tmylk Sure!

@piskvorky
Copy link
Owner

@RishabGoel @tmylk any progress?

tmylk referenced this issue in wmayner/pyemd Jan 6, 2017
@tmylk
Copy link
Contributor

tmylk commented Jan 6, 2017

By the way, @gojomo the flows are available in PyEMD 0.4.1
(Reply to #482 (comment))

@wmayner
Copy link

wmayner commented Jan 12, 2017

@tmylk, @gojomo, @piskvorky, I just updated the requirements and pushed a new version—PyEMD 0.4.2 now works with Python 3.3 and NumPy 1.9, so you should be able to get the flows in Genism now 👍

@tmylk
Copy link
Contributor

tmylk commented Jan 13, 2017

@wmayner Thanks. Confirm it works. Version unpinned. Looking forward to using the flows.

@menshikh-iv
Copy link
Contributor

Implemented by @olavurmortensen in #521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

10 participants