Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

Closed
wants to merge 6 commits into from

Conversation

janrygl
Copy link

@janrygl janrygl commented Nov 3, 2015

Re. #400 and related to #406.

Changes:

  • Pruning words with lower frequencies is removed.
  • Approximative algorithms are used:
    • hyperloglog for counting vocabulary size (default vocabulary size error 1%)
    • CountMinSketch for frequency count (if F is true freq, result is in range [F, F * (1+e)], default e = 0.01)
  • Uses defaultdicts for chunk processing (it can be rewritten to parallel chunk building)
  • May need to update constants:
    • threshold can be lower with higher vocabulary size
    • chunks can be bigger (sentence is expected to contain 20 words, can be 10)
    • allowed errors can be lower

Comparison of old and new implementation:

  • 100 000 000 randomly generated sentences (10 % stop words, 90 % random)
  • max_vocab_size=40000000 (ignored in new implementation)
Implementation Old New
Time (h) 6 h 9 min 28 h 26 min
Time (ratio) 100% 462 %
Vocabulary size 29 700 928 761 363 290
  • 20 000 000 randomly generated sentences (10 % stop words, 90 % random)
  • max_vocab_size=40000000 (ignored in new implementation)
Implementation Old New
Time (h) 2h 26 min 11 h 26 min
Time (ratio) 100% 469 %
Vocabulary size 12 692 683 309 776 447
  • 10 000 000 randomly generated sentences (10 % stop words, 90 % random)
  • max_vocab_size=40000000 (ignored in new implementation)
Implementation Old New
Time (h) 36 min 2 h 53 min
Time (ratio) 100% 480 %
Vocabulary size 4 077 048 81 486 060
  • from nltk.corpus import gutenberg
  • guttenber.sents() (98551 sentences)
Implementation Old New
Time (s) 14 s 34 s
Time (ratio) 100% 240 %
Vocabulary size 618 421 619 564
  • from nltk.corpus import brown
  • brown.sents() 57340 sentences)
Implementation Old New
Time (s) 7 s 24 s
Time (ratio) 100% 340 %
Vocabulary size 503 281 504 632

@@ -242,7 +363,7 @@ def __getitem__(self, sentence):
return [utils.to_unicode(w) for w in new_s]


if __name__ == '__main__':
if __name__ == '__main__' and 0:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?


raise ValueError("min_count should be 1")
if min_count > 1:
logger.warning("min_count should be 1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why retain min_count if it now must be a constant 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I left it is not to break API (it has limited functionality to subtract this count from all frequencies before counting final bi-gram score).
I would be grateful for example how to solve the parameter removal elegantly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, when the underlying behavior changes, it can be more safe/honest to break the API (give a "no such parameter min_count" error) than maintain a superficial compatibility that's no longer having the original effect (silently altering the API) or indeed fails (with a thrown error) in most situations where the old values would have been supported.

Perhaps here it's better for both implementations to co-exist, at least for a while, as different classes with slightly-different options? That may also make it easier to compare their performance, and allow any project that prefers the old precision, or needs reproducibility of prior runs, to stay with the original implementation unless/until they want the benefits of the new.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @gojomo -- we don't want to keep dead code around.

And in case of Phrases, we don't need the "old" behaviour either. The Phrases functionality is new and not widely used yet, so a clear release note saying things have changed seems enough here.

@piskvorky
Copy link
Owner

Starting to look up, good progress!

Next steps:

  1. separate the MinCount/HyperLogLog logic into a separate module, decouple it from Phrases
  2. make Phrases flexible in what counting logic it uses. I think the best way is an injection constructor param that accepts a counter object, with an API that supports MinCount / plain dict with pruning / whatever else conforms to the counter API
  3. use the same counter object injection in word2vec / doc2vec / tfidf...
  4. look into optimizations. The 5x perf hit is actually not that bad, I was expecting worse :) But it seems with a bit of C we could make this much faster still, perhaps even comparable. See e.g. PyBloomFilterMmap for a set of fast hashing routines doing a very similar thing.
  5. look into parallelization. With the way it's written now, we'd have to use multithreading => C-level again to avoid the GIL. With multiprocessing, sending the large dictionaries around would likely kill the performance, so I'm not sure that's a viable path.

@gojomo
Copy link
Collaborator

gojomo commented Nov 3, 2015

I'm pretty excited about an improved phrases-detection, so here are a bunch of random thoughts/comments:

Re: benchmarks

Can peak memory usage be captured, as well? (Might any of the up-to-5x-slowdowns be caused by swapping?)

Are the timings just for an initial 'survey' pass, or do they also include one 'convert-to-phrases' pass?

Would be very useful to see differences in vocab/bigram count, inferred-phrases, and speed/memory performance on a real corpus, like Wikipedia. (I could possibly try this sometime next week.)

Re: possible optimizations

It seems the 'step' (40,000,000 / 20 = 2,000,000) chunks of sentences are used to minimize calls to the (expensive?) increment(). Two possible alternate optimizations to minimize redundant slot-hashing come to mind:

(1) caching the list of column indexes for the last-N/most-common tokens;

(2) rather than using 2M-batches for precise counts, and at the end doing increment() for all tokens then starting with a fresh precise count, use something a bit like the old capped-size dictionary but when hitting the max-size, the 'purge' does increment()s for the N purged keys. (I'm not quite sure if such a purge would be best prioritized by lowest-counts, highest-counts, or oldest-keys, in order to minimize the times the same key is increment()ed.)

Can CountMinSketch objects of the same shape be added together? If so that's one plausible path to parallelization. (More generally, here and other parts of gensim may benefit from the idea of a corpus that can be read from many files, or many start-points in a file, by separate processes, to get away from the one-linear-reader-handing-items-to-many-workers pattern that doesn't work well with the Python GIL & cross-process serialization overhead.)

Re: misc

It might be interesting to store the unigram and bigram counts (and overall unique tallies) in separate structures, for more visibility into what's happening and to give them separate precisions.

It seems the tunable count parameters (or documentation of same) should include some relationship for 'expected unique inserts'. (Don't the actual error margins depend on how saturated the structure gets, like a Bloom Filter, and beyond a certain chosen load factor the errors go beyond target levels?)

@piskvorky piskvorky changed the title Phrases: replace default dict by combination of hyperloglog and CountMinSketch alg. WIP: replace frequency counting using dict by a combination of hyperloglog and CountMinSketch Nov 4, 2015
@piskvorky
Copy link
Owner

I'll also push in other improvement to phrase detection (not related to counting, mostly perf)... probably next weekend. Since this is a major revamp, it's probably best to keep things in one place, to avoid git conflicts.

@mfcabrera
Copy link
Contributor

Hi, nice too see someone continued the work. sadly, due to time/personal issues I couldn't continue. I just wanted to share, that I did profile my original code, and I found out that one of the things hitting the performance was the calculation of the hash functions. I believe some Cython magic might come handy. Is @janrygl still working on this PR right? Let me know if I can help somehow.

@piskvorky
Copy link
Owner

@janrygl has other duties now, so this PR is not "active" at the moment :) Any help welcome!

Btw there are good, fast hash functions in pybloomfiltermmap (item 4 in my list above). There's a lot of overlap here with that project, both conceptually and implementation-wise.

@piskvorky piskvorky changed the title WIP: replace frequency counting using dict by a combination of hyperloglog and CountMinSketch [WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch Dec 6, 2015
@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2016

Pinging @mfcabrera @janrygl - do you think it can be part of gensim January release?

@janrygl
Copy link
Author

janrygl commented Jan 11, 2016

@tmylk It depends on priorities defined by @piskvorky , I got hurt my right hand in the beginning of January and I am behind the schedule with all my projects.

@piskvorky
Copy link
Owner

@janrygl won't have time for this in January for sure; don't know about @mfcabrera .

It's a great little algorithmic project though, very pleasant. I wish I had time to tinker with this myself, would make for an exciting blog post / series!

@tmylk tmylk mentioned this pull request Jan 23, 2016
@piskvorky
Copy link
Owner

Relevant read: Extension to hyperloglog as used by Google. CC @tmylk

@piskvorky
Copy link
Owner

piskvorky commented May 2, 2016

@thescopan
Copy link

Any updates on this branch. Phrases implementation is so slow that it is making me switch to a different library for doc2vec. Any update will be helpful.

@piskvorky
Copy link
Owner

piskvorky commented Mar 3, 2017

@thescopan I don't think so -- feel free to contribute.

Also, a re-implementation of the existing phrases in C/Cython would be appreciated too. It's a really small and trivial change, just one loop, but nobody did it yet. CC @tmylk .

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Cythonising Phrases will be done this summer as part of GSOC

@piskvorky
Copy link
Owner

piskvorky commented May 14, 2017

Awesome! This is a much needed functionality.

A fast & scalable collocation (phrase) detection is sorely missing -- even in our own non-open-source projects.

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

More specifically it's on @prakhar2b 's timeline for end of June.

@piskvorky
Copy link
Owner

More discussion on hyperloglog as used inside reddit:
https://redditblog.com/2017/05/24/view-counting-at-reddit/

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 13, 2017

Ping @janrygl, what status of this PR? Will you finish it soon?

@piskvorky
Copy link
Owner

@menshikh-iv isn't this one of our Google Summer of Code projects this year?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 13, 2017

@piskvorky cythonising phrases is a project of current GSoC, but, as I understand, this is not the exactly same that the topic of this PR.

@piskvorky
Copy link
Owner

piskvorky commented Jun 13, 2017

Nearly the same -- efficient counting is the biggest challenge there. Phrases should definitely use this new counting functionality, among other modules.

It's an extremely common task, widely useful, and that's why I'd like this to be an independent library.

@menshikh-iv
Copy link
Contributor

Connected with #1446

@piskvorky piskvorky mentioned this pull request Jul 11, 2017
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants