Add common terms phrases model #1263

alexgarel · 2017-04-06T15:39:16Z

This follows feature request #1258.

This add a new model CommonTermsPhrases and it's companion CommonTermsPhraser.
Theses class enable taking into account common terms (aka stop words) when searching for bigrams. Common terms are not taken into account in frequency computation but kept along in the final bigram.

It may allow to catch expressions like "eye of the beholder" or "lack of interest" (in language like french it is quite frequent to have common terms).

alexgarel · 2017-04-07T08:21:45Z

@tmylk @gojomo here is my proposal. Feel free to comment.

tmylk · 2017-05-02T20:12:26Z

@alexgarel Thanks for the new feature and the tests. Apologies for the late review.
It seems that a lot of code is duplicated with the usual Phrases which is an issue for maintenance. Would it be possible to add 'common_terms' as an optional parameter to the existing 'Phraser'?

gojomo · 2017-05-14T23:33:04Z

I'm concerned that this isn't quite the ElasticSearch "common grams" approach, but something else, only described/demonstrated/named in the related issue and source code. I can believe it'd be useful in some cases, but have no idea what they are. Without careful/detailed docs of why it does what it does, my thought is it'd be even better further separated from the existing Phrases functionality – for example, marked as experimental in a separate file. Also, I'm still wondering about what does (or should) happen in alternating common-uncommon-common-uncommon-etc cases like the example in my question on the issue.

alexgarel · 2017-05-15T09:14:26Z

@gojomo you're right, it's not "common grams" it only takes inspiration from this approach.
As I see it's not clear to you. Here is another way to see it : it avoid stop words (aka common grams) from getting in the way of making bi-grams from other words. In the actual state of phrases it is easy to capture "school bus" but it's impossible (even with more than one passage in bigram) to capture "bank of america" or "way of life".
This implementation permits to do this. "stop words" in between words are retained in the final expression, but not accounted for statistics.

While managing stop words may be less significant in english because most expression do not have them, it is a greater concern in french for example, where more expressions use them ("bain de bouche", "don de sang", "prêt à porter" etc…).

alexgarel · 2017-05-15T09:21:06Z

@tmylk we have to take a decision, and it's not easy:

add the common grams to the main Phrases and Phraser
refactor Phrases and Phraser to be able to share more code with a separate class for common grams
keep things separated

option 1 and 2 includes paying a potential penalty on performance, while 3 really gives a higher cost on maintenance.

We could benchmark solution 1 and 2 to see the cost. Do you have a preferred data set for running a benchmark ?

I'am ok to work more on this, and add some more explanations on the behaviour, as @gojomo pointed out.

gojomo · 2017-05-15T19:50:11Z

Thanks for the clarification & offer of more integration/doc work!

Is there an article or paper on this technique anywhere? Is it a common/popular technique in other languages, with any existing name?

I'm still a bit unclear about what could possibly happen with certain patterns, like UNCOMMON1 COMMON2 UNCOMMON3 COMMON4 UNCOMMON5. Can this become UNCOMMON1_COMMON2_UNCOMMON3_COMMON4_UNCOMMON5 in single pass? Or are trigrams the largest combinations that come from a single pass?

If a trigram like UNCOMMON1_COMMON2_UNCOMMON3 is creatable after a single pass, is that because the co-occurrence of UNCOMMON1 & UNCOMMON3 (skipping over the COMMON2 term) was learned in the 1st pass?

Will this approach do any of the automatic COMMON_UNCOMMON bigram combining, as in the example on the ElasticSerch Common Grams TokenFilter page? Or is it just a matter of making the COMMON words invisible to phrase-analysis, then attaching them internally (never on the ends) in any phrases still detected?

alexgarel · 2017-05-15T20:32:48Z

@gojomo

there is no publication or existing implementation that I'm aware of. But it's really just an adaptation of the actual bigram algorithm.
yes, UNCOMMON1 COMMON2 UNCOMMON3 COMMON4 UNCOMMON5 would become UNCOMMON1_COMMON2_UNCOMMON3_COMMON4_UNCOMMON5 in a single pass if the statistics speaks for it
UNCOMMON1_COMMON2_UNCOMMON3 is creatable because the co-occurrence of UNCOMMON1 & UNCOMMON3 (with COMMON2 in between) was learned. That is the count is on UNCOMMON1 co-occurrence with COMMON2 UNCOMMON3, but in the formula for retention, COMMON2 frequency is not accounted for.
yes, it's really about combining COMMON_UNCOMMON.

gojomo · 2017-05-15T21:28:42Z

I think I'm starting to get the gist of it. Regarding the interaction of your answers (3) & (4):

(5) Are any of the co-occurrence counts tracked with regard to combined common-words – for example, is the creation of UNCOMMON1_COMMON2_UNCOMMON3 because of UNCOMMON1 co-occurrence count with COMMON2_UNCOMMON3? Or is only the UNCOMMON1 & UNCOMMON3 co-occurrence really counted (without any regard a common-word between)?

(5a) If the former – common-words that are auto-bigrammed become part of the co-occurrence stat keys – will an original text like UNCOMMON1 COMMON2 UNCOMMON3 generate counts for both the UNCOMMON1+COMMON2_UNCOMMON3 co-occurence and the UNCOMMON1_COMMON2+UNCOMMON3 co-occurence? Or is there either some greedy left-to-right, or prefers-common-on-left, practice that only pairs the COMMON with one or the other of its neighbors?

(5b) If the latter – only UNCOMMON to UNCOMMON co-occurrences are truly tracked – it seems that even if UNCOMMON1 COMMON8 UNCOMMON3 never occurs in the training data, it could be phrase-combined in later provided text. Is that correct? Are multiple COMMON words in a row ignored, so that UNCOMMON1 COMMON2 COMMON4 UNCOMMON3 still counts as an UNCOMMON1+UNCOMMON3 co-occurrence, and potentially rolled-up into UNCOMMON1_COMMON2_COMMON4_UNCOMMON3?

alexgarel · 2017-05-16T11:18:36Z

@gojomo

"UNCOMMON1 COMMON2 UNCOMMON3" make the vocabulary tracks "UNCOMMON1_COMMON2_UNCOMMON3" and UNCOMMON1 and UNCOMMON3. At resolution time, the score is (ignoring min count):
freq("UNCOMMON1_COMMON2_UNCOMMON3") / ( freq(UNCOMMON1) * freq(UNCOMMON3) )
So really the common words participate in co-occurrence resolution, and it's not equivalent to their removal (that's why this approach is akin to common terms in elasticsearch, and not stop words).

a. I really just need to keep tracks of common words between uncomon ones, as already stated above.

b. if only UNCOMMON1 & UNCOMMON3 co-occurrence where counted, it would be more close to removal of stop words with a re-introduction of them at n-gram generation time, and I don't think this would give consistent results. The goal is really to find that "point of view" is a frequent expression in regard to point and view frequency (just dis-riguarding "of" frequency in the computation because classified as too common),

gojomo · 2017-05-16T18:46:47Z

OK, I think I've almost got it. Regarding (5b), does freq('UNCOMMON1_COMMON2_COMMON4_UNCOMMON3') get counted? (That is, multiple common-words-between?)

(6) Are simple bigrams of one common and one-uncommon word ever combined? (From what I know understand of what stats are kept, it seems this would have to be an 'always' or 'never' decision.)

alexgarel · 2017-05-17T22:38:46Z

@gojomo, first thanks for your questions, this conversation will help having a clear documentation.

b. as you got it, freq('UNCOMMON1_COMMON2_COMMON4_UNCOMMON3') get's counted. In fact the algorithm just not consider 1 common words but any suite of common words between two (uncommon) words. In french you will encounter such expressions like "président de la république" (trendy those days ;-)) and anyway it does not cost much more to handle that.
I made the implementation choice to never try to combine bigram of 'COMMON_UNCOMMON' or 'UNCOMMON_COMMON'. It does not really make sense, as in the equation you have freq(word1_word2) / ((freq(word1) * freq(word2)). If eg., word1 is common, freq(word1) will be very high (as language follows a zipf law) and so there is no chance for it to be above the threshold.

alexgarel · 2017-06-07T17:09:30Z

Hello @gojomo and @tmylk I finally continue my work on this.

So I created a branch where I've done an implementation of common terms capability directly in the Phrases / Phraser class.

I then run some tests to see how it would impact using or not common terms.
Notebooks here:

So, according to my non scientific test, performance on bigram if you use it (without common terms) is :

on creation: no visible impact
on usage (getitem calls): between 1% and 2% impact

This branch is not meant to be merge of course but just to serve discussion. If the performance impact seems ok, I will do a new pull request with this implementations.

menshikh-iv · 2017-07-11T07:47:14Z

Ping @gojomo @alexgarel, what's a status of this PR?

alexgarel · 2017-07-11T08:46:11Z

Hello @menshikh-iv, I will come back soon with a new PR, so I close this one.

piskvorky · 2017-08-05T05:34:00Z

@alexgarel this sounds like a really great feature -- please let me know if you need any assistance with the new PR.

Alex Garel added 2 commits April 6, 2017 17:24

Add common terms phrases model

f7344fd

fix bug with python2.7 for CommonTermsPhrases

79331f9

alexgarel mentioned this pull request Apr 11, 2017

Adding common-words to Phrases #1258

Closed

gojomo mentioned this pull request May 25, 2017

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

Closed

alexgarel closed this Jul 11, 2017

This was referenced Sep 5, 2017

adding common terms to phrases model #1567

Closed

Adding common terms to phrases model #1568

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add common terms phrases model #1263

Add common terms phrases model #1263

alexgarel commented Apr 6, 2017

alexgarel commented Apr 7, 2017

tmylk commented May 2, 2017

gojomo commented May 14, 2017

alexgarel commented May 15, 2017

alexgarel commented May 15, 2017

gojomo commented May 15, 2017

alexgarel commented May 15, 2017

gojomo commented May 15, 2017

alexgarel commented May 16, 2017

gojomo commented May 16, 2017

alexgarel commented May 17, 2017 •

edited

Loading

alexgarel commented Jun 7, 2017 •

edited

Loading

menshikh-iv commented Jul 11, 2017

alexgarel commented Jul 11, 2017

piskvorky commented Aug 5, 2017

Add common terms phrases model #1263

Add common terms phrases model #1263

Conversation

alexgarel commented Apr 6, 2017

alexgarel commented Apr 7, 2017

tmylk commented May 2, 2017

gojomo commented May 14, 2017

alexgarel commented May 15, 2017

alexgarel commented May 15, 2017

gojomo commented May 15, 2017

alexgarel commented May 15, 2017

gojomo commented May 15, 2017

alexgarel commented May 16, 2017

gojomo commented May 16, 2017

alexgarel commented May 17, 2017 • edited Loading

alexgarel commented Jun 7, 2017 • edited Loading

menshikh-iv commented Jul 11, 2017

alexgarel commented Jul 11, 2017

piskvorky commented Aug 5, 2017

alexgarel commented May 17, 2017 •

edited

Loading

alexgarel commented Jun 7, 2017 •

edited

Loading