Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common terms phrases model #1263

Closed
wants to merge 2 commits into from

Conversation

alexgarel
Copy link
Contributor

This follows feature request #1258.

This add a new model CommonTermsPhrases and it's companion CommonTermsPhraser.
Theses class enable taking into account common terms (aka stop words) when searching for bigrams. Common terms are not taken into account in frequency computation but kept along in the final bigram.

It may allow to catch expressions like "eye of the beholder" or "lack of interest" (in language like french it is quite frequent to have common terms).

@alexgarel
Copy link
Contributor Author

@tmylk @gojomo here is my proposal. Feel free to comment.

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

@alexgarel Thanks for the new feature and the tests. Apologies for the late review.
It seems that a lot of code is duplicated with the usual Phrases which is an issue for maintenance. Would it be possible to add 'common_terms' as an optional parameter to the existing 'Phraser'?

@gojomo
Copy link
Collaborator

gojomo commented May 14, 2017

I'm concerned that this isn't quite the ElasticSearch "common grams" approach, but something else, only described/demonstrated/named in the related issue and source code. I can believe it'd be useful in some cases, but have no idea what they are. Without careful/detailed docs of why it does what it does, my thought is it'd be even better further separated from the existing Phrases functionality – for example, marked as experimental in a separate file. Also, I'm still wondering about what does (or should) happen in alternating common-uncommon-common-uncommon-etc cases like the example in my question on the issue.

@alexgarel
Copy link
Contributor Author

@gojomo you're right, it's not "common grams" it only takes inspiration from this approach.
As I see it's not clear to you. Here is another way to see it : it avoid stop words (aka common grams) from getting in the way of making bi-grams from other words. In the actual state of phrases it is easy to capture "school bus" but it's impossible (even with more than one passage in bigram) to capture "bank of america" or "way of life".
This implementation permits to do this. "stop words" in between words are retained in the final expression, but not accounted for statistics.

While managing stop words may be less significant in english because most expression do not have them, it is a greater concern in french for example, where more expressions use them ("bain de bouche", "don de sang", "prêt à porter" etc…).

@alexgarel
Copy link
Contributor Author

@tmylk we have to take a decision, and it's not easy:

  1. add the common grams to the main Phrases and Phraser
  2. refactor Phrases and Phraser to be able to share more code with a separate class for common grams
  3. keep things separated

option 1 and 2 includes paying a potential penalty on performance, while 3 really gives a higher cost on maintenance.

We could benchmark solution 1 and 2 to see the cost. Do you have a preferred data set for running a benchmark ?

I'am ok to work more on this, and add some more explanations on the behaviour, as @gojomo pointed out.

@gojomo
Copy link
Collaborator

gojomo commented May 15, 2017

Thanks for the clarification & offer of more integration/doc work!

Is there an article or paper on this technique anywhere? Is it a common/popular technique in other languages, with any existing name?

I'm still a bit unclear about what could possibly happen with certain patterns, like UNCOMMON1 COMMON2 UNCOMMON3 COMMON4 UNCOMMON5. Can this become UNCOMMON1_COMMON2_UNCOMMON3_COMMON4_UNCOMMON5 in single pass? Or are trigrams the largest combinations that come from a single pass?

If a trigram like UNCOMMON1_COMMON2_UNCOMMON3 is creatable after a single pass, is that because the co-occurrence of UNCOMMON1 & UNCOMMON3 (skipping over the COMMON2 term) was learned in the 1st pass?

Will this approach do any of the automatic COMMON_UNCOMMON bigram combining, as in the example on the ElasticSerch Common Grams TokenFilter page? Or is it just a matter of making the COMMON words invisible to phrase-analysis, then attaching them internally (never on the ends) in any phrases still detected?

@alexgarel
Copy link
Contributor Author

@gojomo

  1. there is no publication or existing implementation that I'm aware of. But it's really just an adaptation of the actual bigram algorithm.

  2. yes, UNCOMMON1 COMMON2 UNCOMMON3 COMMON4 UNCOMMON5 would become UNCOMMON1_COMMON2_UNCOMMON3_COMMON4_UNCOMMON5 in a single pass if the statistics speaks for it

  3. UNCOMMON1_COMMON2_UNCOMMON3 is creatable because the co-occurrence of UNCOMMON1 & UNCOMMON3 (with COMMON2 in between) was learned. That is the count is on UNCOMMON1 co-occurrence with COMMON2 UNCOMMON3, but in the formula for retention, COMMON2 frequency is not accounted for.

  4. yes, it's really about combining COMMON_UNCOMMON.

@gojomo
Copy link
Collaborator

gojomo commented May 15, 2017

I think I'm starting to get the gist of it. Regarding the interaction of your answers (3) & (4):

(5) Are any of the co-occurrence counts tracked with regard to combined common-words – for example, is the creation of UNCOMMON1_COMMON2_UNCOMMON3 because of UNCOMMON1 co-occurrence count with COMMON2_UNCOMMON3? Or is only the UNCOMMON1 & UNCOMMON3 co-occurrence really counted (without any regard a common-word between)?

(5a) If the former – common-words that are auto-bigrammed become part of the co-occurrence stat keys – will an original text like UNCOMMON1 COMMON2 UNCOMMON3 generate counts for both the UNCOMMON1+COMMON2_UNCOMMON3 co-occurence and the UNCOMMON1_COMMON2+UNCOMMON3 co-occurence? Or is there either some greedy left-to-right, or prefers-common-on-left, practice that only pairs the COMMON with one or the other of its neighbors?

(5b) If the latter – only UNCOMMON to UNCOMMON co-occurrences are truly tracked – it seems that even if UNCOMMON1 COMMON8 UNCOMMON3 never occurs in the training data, it could be phrase-combined in later provided text. Is that correct? Are multiple COMMON words in a row ignored, so that UNCOMMON1 COMMON2 COMMON4 UNCOMMON3 still counts as an UNCOMMON1+UNCOMMON3 co-occurrence, and potentially rolled-up into UNCOMMON1_COMMON2_COMMON4_UNCOMMON3?

@alexgarel
Copy link
Contributor Author

@gojomo

  1. "UNCOMMON1 COMMON2 UNCOMMON3" make the vocabulary tracks "UNCOMMON1_COMMON2_UNCOMMON3" and UNCOMMON1 and UNCOMMON3. At resolution time, the score is (ignoring min count):
    freq("UNCOMMON1_COMMON2_UNCOMMON3") / ( freq(UNCOMMON1) * freq(UNCOMMON3) )
    So really the common words participate in co-occurrence resolution, and it's not equivalent to their removal (that's why this approach is akin to common terms in elasticsearch, and not stop words).

    a. I really just need to keep tracks of common words between uncomon ones, as already stated above.

    b. if only UNCOMMON1 & UNCOMMON3 co-occurrence where counted, it would be more close to removal of stop words with a re-introduction of them at n-gram generation time, and I don't think this would give consistent results. The goal is really to find that "point of view" is a frequent expression in regard to point and view frequency (just dis-riguarding "of" frequency in the computation because classified as too common),

@gojomo
Copy link
Collaborator

gojomo commented May 16, 2017

OK, I think I've almost got it. Regarding (5b), does freq('UNCOMMON1_COMMON2_COMMON4_UNCOMMON3') get counted? (That is, multiple common-words-between?)

(6) Are simple bigrams of one common and one-uncommon word ever combined? (From what I know understand of what stats are kept, it seems this would have to be an 'always' or 'never' decision.)

@alexgarel
Copy link
Contributor Author

alexgarel commented May 17, 2017

@gojomo, first thanks for your questions, this conversation will help having a clear documentation.

  1. b. as you got it, freq('UNCOMMON1_COMMON2_COMMON4_UNCOMMON3') get's counted. In fact the algorithm just not consider 1 common words but any suite of common words between two (uncommon) words. In french you will encounter such expressions like "président de la république" (trendy those days ;-)) and anyway it does not cost much more to handle that.

  2. I made the implementation choice to never try to combine bigram of 'COMMON_UNCOMMON' or 'UNCOMMON_COMMON'. It does not really make sense, as in the equation you have freq(word1_word2) / ((freq(word1) * freq(word2)). If eg., word1 is common, freq(word1) will be very high (as language follows a zipf law) and so there is no chance for it to be above the threshold.

@alexgarel
Copy link
Contributor Author

alexgarel commented Jun 7, 2017

Hello @gojomo and @tmylk I finally continue my work on this.

So I created a branch where I've done an implementation of common terms capability directly in the Phrases / Phraser class.

I then run some tests to see how it would impact using or not common terms.
Notebooks here:

So, according to my non scientific test, performance on bigram if you use it (without common terms) is :

  • on creation: no visible impact
  • on usage (getitem calls): between 1% and 2% impact

This branch is not meant to be merge of course but just to serve discussion. If the performance impact seems ok, I will do a new pull request with this implementations.

@menshikh-iv
Copy link
Contributor

Ping @gojomo @alexgarel, what's a status of this PR?

@alexgarel
Copy link
Contributor Author

Hello @menshikh-iv, I will come back soon with a new PR, so I close this one.

@alexgarel alexgarel closed this Jul 11, 2017
@piskvorky
Copy link
Owner

@alexgarel this sounds like a really great feature -- please let me know if you need any assistance with the new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants