add 'Word Mover's Distance' implementation to gensim? #482

gojomo · 2015-10-15T03:44:48Z

"From Word Embeddings to Document Distances" (Kusner et al 2015: http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf) introduces the "Word Mover's Distance" (WMD), a novel distance-between-text-documents measure. It is an adaptation of another distance metric, "Earth Mover's Distance" (EMD), introduced(?) in "A Metric for Distributions with Applications
to Image Databases" (Rubner et al 1998: http://www.cs.jhu.edu/~misha/Papers/Rubner98.pdf), that is useful for comparing images and can be calculated as a special case of much-older transportation problem optimizations. For text, the WMD leverages word2vec vectors of the documents' individual words, in a way that seems to outperform simple combinations (sum/mean) of those word vectors.

There's a blog post from OpenTable with some impressive examples of "similar sentences" using WMD on restaurant reviews: http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

The paper reports strong kNN classifier results on document classification, and intuitively, the method seems amenable to the selection (or even synthesis) of 'canonical' or 'borderline/contrastive' text segments, perhaps to assist iterative classification tasks.

The author's code is available as a ZIP bundle (http://matthewkusner.com/#page2) – but its python wrappers depend on some older C EMD code of unclear licensing status. There's a bunch of other EMD code around for other projects (esp. image similarity measures) that also might be a good starting point.

An implementation (perhaps in optimized cython) for gensim might be widely useful.

(Further out: it might be interesting to use WMD to induce a document-embedding. That is, train doc embeddings with random draws of 3 documents, nudging the embeddings so that the relative which-two-are-closest is the same in the embedding space as in the WMD metric.)

piskvorky · 2015-10-15T07:35:19Z

CC @tmylk: backlog of open source tasks.

tmylk · 2015-10-15T08:13:03Z

Added to https://github.com/piskvorky/gensim/wiki/Ideas-&-Feature-proposals

piskvorky · 2015-11-07T02:56:52Z

@olavurmortensen has taken up this issue now and is working on the integration of WMD into gensim.

Regarding the "unclear license": let's clarify with @mkusner. We definitely don't want to include anything without a license / LGPL non-compatible.

gojomo · 2015-11-07T05:50:35Z

This may be a preferable C codebase to build on – http://www.ariel.ac.il/sites/ofirpele/FastEMD/code/ – as the authors of it and the related paper make a case for it (and their related "EMD-hat" metric) being faster/better than earlier methods, and it has a clear BSD license.

There's also an existing python wrapper for it: https://github.com/wmayner/pyemd

That pyemd project is missing interfaces to some of the interesting capabilities of the underlying code, but they might be pretty easy to add:

an adjustable penalty for excess 'unmoved' mass (there's an unmerged PR for that)
access to the exact best 'flows' (issue discussing: Getting the flows wmayner/pyemd#2)

olavurmortensen · 2015-11-07T19:39:33Z

pyemd is able to handle sequences of differing lengths, which mkusner's version is not. Although, at this point I'm guessing what the correct input to pyemd's emd function are, because I can't find an explanation.

I found an odd thing in both implementations implementation. If I add the same word to both sentences, the distance decreases; example: "two three" and "four five" are more distant than "two three one" and "four five "one". From my understanding of EMD, this does not make sense, as the distance from "one" to "one" is zero, and thus should add nothing to the EMD.

I'm working on this here at the moment.

mkusner · 2015-11-07T20:31:57Z

Hey all! I would absolutely love to include WMD in gensim. It's such a great package!! In terms of an implementation, Vlad Niculae (https://github.com/vene) at Cornell has made a very awesome scikit_learn-friendly version of the Word Mover's Distance that is much more robust and well-documented than my original code release. He just released a blog post with this code here: http://vene.ro/blog/word-movers-distance-in-python.html Maybe @olavurmortensen can merge any code with this that isn't overlapping? It's very exciting to have all of these implementations!

As to the comment @olavurmortensen raised about the distance decreasing when "one" is added to both documents: this is because WMD currently normalizes the word counts by the total number of words in the document. Thus, you are right that the distance between "one" and "one" zero, but the distance between the two documents without "one" is:
D = 0.5_d("two") + 0.5_d("three")

(where d("two") is the Euclidean distance between vector representing "two" and either the vector for "four" or "five" depending on the optimal EMD distance, and same for d("three")) And the distance with "one" is:

D = 0.33_d("two") + 0.33_d("three") + 0.33*0

So the distance shrinks. For us it made sense to think of document counts as a distribution over words and the WMD as the cost of morphing the distribution of one document into another. But if this isn't what you're looking for then you can remove this normalization and you'll get the behavior you're looking for. Let me know if you have any more questions about this.

Let us know about what would be next best to do to work WMD code into gensim!

gojomo · 2015-11-07T21:24:34Z

I see @vene's work also uses @wmayner's pyemd wrapper of Pele & Wermen's code.

The treatment where adding 'one' to both distributions results in a smaller-distance better fits my intuitive understanding of what we'd usually like the WMD to do, when doing text-similarity. Two sentences that share more words should be closer: their meanings have greater overlap.

The reason I'd like to see pyemd (and gensim) offer an interface into the handling of unmatched 'mass' and the net 'flows' is that I have a hunch that could offer some interesting possibilities for further characterizing the difference between texts. (For example the unmatched mass, or the flows-between-most-distant-word-pairs, may be useful as a more-than-just-scalar indication of what topics are unique to each side of a comparison.)

piskvorky · 2015-11-08T04:27:10Z

Great to have a working reference implementation! Thanks so much for all the info and tips @mkusner :)

@vene are you OK with using parts of your code for gensim? I haven't checked your blog post in detail yet (in a minute!), but I'm thinking some parts may be directly transferable.

I'm not excited about the duplication of effort obviously, but WMD seems like a very useful functionality. We definitely want this in "core gensim", rather than relying on external code / external lib release cycles or APIs.

vene · 2015-11-08T15:59:40Z

Sure, I'd be happy if my code could help, and I could also lend a hand maybe.

@gojomo I really like your thoughts on extracting the actual flows/matchings! WMD, as it is, is great for KNN classification, but I tried using it for feature extraction between pairs of texts, and I didn't get results as exciting as it might have been with access to the flow.

About penalty for unmatched mass, I'm not sure the intuition is, I'd love to hear your thoughts!

This is all really exciting and it would be great to collaborate on this!

olavurmortensen · 2015-11-09T17:35:40Z

Here's a plan that I worked out with @tmylk:

Implement a method for calculating the WMD in Gensim, including the pyemd EMD wrapper in Gensim.
Implement a kNN classifier using WMD and document similarities in Gensim.

I will be drawing inspiration from @vene's blog during the process.

wmayner · 2015-11-09T21:17:00Z

Hey folks,

Really cool to see that pyemd might be useful for your project! Because of the interest, I finally got around to merging the PR that @vbraun submitted to expose the extra_mass_penalty argument. I just pushed v0.2.0 with that addition.

Getting the flows from the algorithm is something others have expressed interest in as well. Unfortunately I'm super busy at the moment and I won't be able to work on that anytime soon, but it is possible to pass a NumPy array to a C++ function via Cython so that the C++ function can then fill it with data—see, for example, this StackOverflow question—so implementing it shouldn't be too difficult. I'd be happy to help with a PR if someone wants to give it a shot.

@olavurmortensen, check the docstring for the emd function—it tells you what the correct input is (you can use help(emd), or if you're using ipython, you can do emd?).

Cheers,
Will

piskvorky · 2015-11-10T06:22:45Z

Thanks @wmayner !

What are your future plans regarding pyemd? @tmylk is considering subsuming (some of) pyemd code inside gensim. It will make the release cycles and custom adjustments and maintenance easier for us, but adds extra burden on "syncing" potential changes from your upstream repo. So I'm not sure.

olavurmortensen · 2015-11-10T16:07:05Z

I'm working on this on a branch of my own fork of Gensim, and have submitted a PR.

wmayner · 2015-11-10T18:24:45Z

@piskvorky, it's unlikely that our use of the EMD will change significantly in the foreseeable future, so right now I'm not planning on doing much more work on pyemd. I'd say forking it and merging with your codebase so you have more control makes more sense, since there won't be many upstream changes to incorporate, if any.

Cheers,
Will

ddofer · 2016-07-27T09:07:31Z

Hi,
I heard yesterday from Gensim's community manager in Israel that you guys have added an implementation of WMD (atop Word Vectors) in Gensim; From what I see here it seems that it's not actually implemented yet in gensim itself?

olavurmortensen · 2016-07-27T09:37:26Z

@ddofer This is partly true. Gensim prepares the text data to be input to a solver in the pyemd library. This solver finds the EMD distance (which inspired WMD), and essentially solves an optimization problem known as the "transportation problem".

To my knowledge there is no plan to implement a EMD solver directly in Gensim.

Did this answer your question @ddofer?

gojomo · 2016-08-03T13:13:36Z

Noticed a small error in WMD_tutorial.ipynb: it currently suggests about the GoogleNews vectors: "It just so happens that the vectors we have downloaded are already normalized, so it won't do any difference in this case." In fact they're not natively normed – only a similarity-operation or explicit init_sims() creates the normed version.

tmylk · 2016-10-05T08:52:29Z

@RishabGoel Could you please update the ipynb tutorial with this correction?

RishabGoel · 2016-10-05T20:12:51Z

@tmylk Sure!

piskvorky · 2016-11-04T05:35:32Z

@RishabGoel @tmylk any progress?

tmylk · 2017-01-06T23:55:45Z

By the way, @gojomo the flows are available in PyEMD 0.4.1
(Reply to #482 (comment))

wmayner · 2017-01-12T23:34:53Z

@tmylk, @gojomo, @piskvorky, I just updated the requirements and pushed a new version—PyEMD 0.4.2 now works with Python 3.3 and NumPy 1.9, so you should be able to get the flows in Genism now 👍

tmylk · 2017-01-13T02:21:28Z

@wmayner Thanks. Confirm it works. Version unpinned. Looking forward to using the flows.

menshikh-iv · 2017-10-03T06:48:32Z

Implemented by @olavurmortensen in #521

piskvorky added feature Issue described a new feature wishlist Feature request difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 15, 2015

piskvorky removed the wishlist Feature request label Nov 7, 2015

olavurmortensen mentioned this issue Nov 10, 2015

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

Closed

tmylk added a commit that referenced this issue Apr 5, 2016

Word Movers Distance #482 , #521 and #619

a355c69

piskvorky assigned tmylk Jul 27, 2016

tmylk referenced this issue in wmayner/pyemd Jan 6, 2017

Update required numpy version

e3bf5a7

tmylk mentioned this issue Jan 13, 2017

Remove pin for PyEMD #1089

Merged

menshikh-iv closed this as completed Oct 3, 2017

menshikh-iv unassigned tmylk Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add 'Word Mover's Distance' implementation to gensim? #482

add 'Word Mover's Distance' implementation to gensim? #482

gojomo commented Oct 15, 2015

piskvorky commented Oct 15, 2015

tmylk commented Oct 15, 2015

piskvorky commented Nov 7, 2015

gojomo commented Nov 7, 2015

olavurmortensen commented Nov 7, 2015

mkusner commented Nov 7, 2015

gojomo commented Nov 7, 2015

piskvorky commented Nov 8, 2015

vene commented Nov 8, 2015

olavurmortensen commented Nov 9, 2015

wmayner commented Nov 9, 2015

piskvorky commented Nov 10, 2015

olavurmortensen commented Nov 10, 2015

wmayner commented Nov 10, 2015

ddofer commented Jul 27, 2016

olavurmortensen commented Jul 27, 2016

gojomo commented Aug 3, 2016

tmylk commented Oct 5, 2016

RishabGoel commented Oct 5, 2016

piskvorky commented Nov 4, 2016

tmylk commented Jan 6, 2017 •

edited

Loading

wmayner commented Jan 12, 2017

tmylk commented Jan 13, 2017 •

edited

Loading

menshikh-iv commented Oct 3, 2017

add 'Word Mover's Distance' implementation to gensim? #482

add 'Word Mover's Distance' implementation to gensim? #482

Comments

gojomo commented Oct 15, 2015

piskvorky commented Oct 15, 2015

tmylk commented Oct 15, 2015

piskvorky commented Nov 7, 2015

gojomo commented Nov 7, 2015

olavurmortensen commented Nov 7, 2015

mkusner commented Nov 7, 2015

gojomo commented Nov 7, 2015

piskvorky commented Nov 8, 2015

vene commented Nov 8, 2015

olavurmortensen commented Nov 9, 2015

wmayner commented Nov 9, 2015

piskvorky commented Nov 10, 2015

olavurmortensen commented Nov 10, 2015

wmayner commented Nov 10, 2015

ddofer commented Jul 27, 2016

olavurmortensen commented Jul 27, 2016

gojomo commented Aug 3, 2016

tmylk commented Oct 5, 2016

RishabGoel commented Oct 5, 2016

piskvorky commented Nov 4, 2016

tmylk commented Jan 6, 2017 • edited Loading

wmayner commented Jan 12, 2017

tmylk commented Jan 13, 2017 • edited Loading

menshikh-iv commented Oct 3, 2017

tmylk commented Jan 6, 2017 •

edited

Loading

tmylk commented Jan 13, 2017 •

edited

Loading