Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding common-words to Phrases #1258

Closed
alexgarel opened this issue Apr 4, 2017 · 8 comments
Closed

Adding common-words to Phrases #1258

alexgarel opened this issue Apr 4, 2017 · 8 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@alexgarel
Copy link
Contributor

I have a proof of concept of Phrases class managing stop words, but before doing a pull request, I would be glad to know if there is interest and how to integrate it.

That is currently if you are searching to reveal ngrams like "car with driver" and "car without driver", you can either remove stop words before processing, but you will only find "car driver", or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly).

Taking inspiration from elasticsearch and its common gram filter I have an implementation which can handle stop words to find those expressions. It does it by registering "car_with_driver" in the vocab instead of "car_with", and taking it into account when tokenizing phrases. I've made a gist of a draft implementation (not implementing all functions)

  1. Is there interest in such a solution ?
  2. should I provide a new class handling that, or should I modify existing class to accept a stopwords parameter (empty by default) ?

If there is interest I will do a PR.

@gojomo
Copy link
Collaborator

gojomo commented Apr 4, 2017

Sounds useful, so I would welcome it as an option.

If adding to the existing class could be a small change, and it leaves behavior of 'classic' mode unchanged when not activated, that seems an OK way to add it. But I know there may be some other efforts in progress to optimize (or Cythonize) Phrases – so @tmylk may have other preferences about a 1st implementation. (I suppose the other alternative would be a separate class, eg PhrasesWithCommonWords, that starts as a direct copy but then adds the new functionality – which could also help make the changes clear for a later merge.)

Looking at the description at the ElasticSearch link, I wonder:

  • should the words-handled-specially be called common_words instead of stop_words, to match their practice?
  • does the postprocessed text then include the same word in both its combined and uncombined forms, as in the examples there? (That'd be an important behavior to make clear to users, as it would change context-window-sensitive analysis that could come later, as in Word2Vec/Doc2Vec.)

@tmylk
Copy link
Contributor

tmylk commented Apr 4, 2017

It is a needed functionality and a Pure Python implementation is a good place to start. Please make it optional though.

@alexgarel
Copy link
Contributor Author

Hello, thanks for the comments, I will begin by providing a pure python stand-alone implementation.

@gojomo, +1 to use common_words as a name.
For your seconde point, I 'm not sure I understand the question !
phrase[["we", "provide", "car", "with", "driver"]] would yield:
["we", "provide", "car with driver"]

@gojomo gojomo changed the title Adding stopwords to Phrases Adding common-words to Phrases Apr 5, 2017
@gojomo
Copy link
Collaborator

gojomo commented Apr 5, 2017

Per the example on the elasticsearch page (about "the quick brown is a fox"), I would expect:

input: ["we", "provide", "car", "with", "driver"]

…to yield…

output: ["we", "provide", "car", "car_with", "with", "with_driver", "driver"]

That might be ideal for search-indexing, and some gensim users, but would be unexpected (with unclear implications) for something like Word2Vec neighboring-word context-windows.

@alexgarel
Copy link
Contributor Author

@gojomo - clearly I just draw inspiration from the common word filter, but it is an adaptation (yielding only car_with_driver)

@gojomo
Copy link
Collaborator

gojomo commented Apr 6, 2017

Despite the difficulties for later order-sensitive windows, the elasticsearch approach seems potentially more valuable to me, in that all possibilities are generated, then only some might survive some later frequency- or salience-check. Combining the common-word on both-sides, always, seems likely to create overlong phrases.

For example, what would (or should) happen in longer common-uncommon-common-uncommon-etc patterns? For example, assuming each of the * words are 'common', does...

 We're having *a sale *on *the hats *with orange *and green spots.

...become just the 3 tokens...

We're having_a_sale_on_the_hats_with_orange_and_green spots.

?

@rpedela
Copy link

rpedela commented Apr 11, 2017

What about bigrams where one of the words is a stopword and it is actually a phrase? This happens in legal documents such as "Side A" or "Exhibit A". Would the CommonGrams Phraser help with that case?

@alexgarel
Copy link
Contributor Author

@rpedela in the implementation I proposed in #1263 no. Common grams are just considered between too normal words.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

5 participants