Fast object counting + Phrases #1446

piskvorky · 2017-06-23T14:44:27Z

Create a new class for fast, memory-efficient object counting (bounded RAM footprint, even for super large corpora).

To be used in Phrases, Dictionary, Word2vec etc. Basically everywhere where a vocabulary larger than RAM can be expected (long tail, Zipf's law).

Perhaps we can release this as a separate, stand-alone Python library (widely useful even outside of gensim)?

design APIs
choose efficient algorithms and data structures
implement reference implementation in raw Python (slow but correct)
optimize hotspots in C/Cython
parallelize, if necessary
implement Dictionary, Phrases, Word2vec, Doc2vec using this new counter

… 2 counters)

piskvorky · 2017-06-23T15:12:14Z

@gojomo @menshikh-iv @jayantj WDYT?

This could be a starting point for the new API, to be optimized and then used in Phrases and Dictionary.

piskvorky · 2017-06-23T17:57:57Z

Using preshed seems to bring a ~2x speedup over plain defaultdict (55s for FastCounter vs 11s for cythonized FastCounter vs 5s for cythonized Preshed, all on the text8 corpus).

gojomo · 2017-06-23T19:33:24Z

gensim/models/fast_counter.py

+        """
+        for document in documents:
+            for item in self.doc2items(document):
+                self.hash2cnt[self.hash(item)] += 1


That this implementation ignores hash-collisions is important detail to document.

gojomo · 2017-06-23T19:35:04Z

gensim/models/fast_counter_cython.pyx

+                for idx in range(1, l):
+                    h2 = chash(document[idx])
+                    hash2cnt[h2] += 1
+                    hash2cnt[h1 + h2] += 1


Simple addition to determine hash of bigram gives (A, B) and (B, A) same hash value – problematic for Phrase-promotion.

Yeah, I had XOR here before, but then A ^ A == 0 == B ^ B, which is also not good. The hashing requires a little more thought, too hackish.

We can use h1 ** h2, but this will lead to overflow

That's not a native operation (instruction), so probably not a good idea.

Google shows me this for boost::hash_combine

size_t hash_combine( size_t lhs, size_t rhs ) { lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2); return lhs; }

That looks fancy, but probably anything that breaks the symmetry would be enough, such as 3*lhs + rhs.

Here's a repo with a bunch of useful hashing algorithms & their comparison: https://github.com/rurban/smhasher

One more interesting resource on hashing (in the context of Bloom filters / min-sketch counting in #508 ): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
CC @isamaru

gojomo · 2017-06-23T19:44:03Z

Worth testing against collections.Counter, too, for reference.

Certainly worth trying as drop-in for Phrases - and If/when cythonized code or preshed-ized code gives identical output on representative inputs, in less time or memory, replace existing steps.

piskvorky · 2017-06-24T04:06:03Z

More interested in feedback for the API now, not performance.

Is the interface complete, anything else we should support? Do you envision any issues when plugging this into Dictionary? Or into other use-cases that need fast&memory-efficient counting?

Btw I tried Counter too (see commit history), but it's quite a bit slower than defaultdict.

piskvorky · 2017-06-24T12:34:53Z

On second thoughts, how about simply mimicking the API of stdlib Counter?

Just call it BoundedCounter, slap on one more ctr parameter (max_size, either #items or #RAM bytes), and one more method (update_from_documents) and keep the rest standard?

Some Counter methods might not be supported in BoundedCounter, depending on what operations does the underlying data structure allow us to do, but it may be the path of least surprise for users.

jayantj · 2017-06-25T17:09:53Z

About the API, I had the same thing in mind, using the same API as Counter is the most intuitive, I think. (though I'm not sure what stdlib Counter means, I can't find anything about it either - is it the Counter from collections?)
Simply updating counts from an iterable of hashables is the most common use case IMO, the update_from_documents is more specific to Gensim/Phrases. It might even be a good idea to keep update_from_documents out of FastCounter completely, and add a simple wrapper/subclass over FastCounter for Gensim with the method. The notion of "documents" is again specific to the Gensim use case, I think.

Also not sure about BoundedCounter since max_size should only be an optional param. The current FastCounter sounds good to me.

Also, it looks like you were really on fire! :)

jayantj · 2017-06-25T17:17:04Z

gensim/models/fast_counter.py

+        # XXX: Or use a fixed-size data structure to start with (hyperloglog?)
+        while self.max_size and len(self) > self.max_size:
+            self.min_reduce += 1
+            utils.prune_vocab(self.hash2cnt, self.min_reduce)


This can potentially be very slow, as very utils.prune_vocab call is going to iterate over the entire vocabulary. Could be worthwhile to store a min_count property. Depending on the data structures used, this might not be trivial though.
Anyway, the discussion right now is API/design and not optimization, I suppose.

What would this min_count property do?

I'd want to see profiling that indicates this survey is a major time-cost before spending much time optimizing. If it needed to be optimized, it might be possible to retain a min_reduce/min_count threshold and, during tallying, track when a key's tally exceeds this count, perhaps removing it from a 'most-eligible-for-pruning' list.

With this existing algorithm, where each prune, everything under the count reached last time is automatically discarded, this hypothetical "most eligible" list could be discarded 1st – without even looking at the existing counts again. Maybe that'd be enough... though if keeping this 'escalating min_reduce' you'd still need to iterate over all the remaining keys. (It seems a lot like GC, with a 'tenured' set that might be able to skip some collections.)

But as I've mentioned in previous discussions around Phrases: I suspect this 'escalating floor prune' is a somehat undesirable algorithm, injecting more imprecision/bias in final counts than necessary, and (maybe) losing a lot of keys that would survive a truly unbounded counting method. (TLDR: because of typical freq-distributions plus the always-escalating floow, every prune may wind up eliminating 90%+ of all existing words, starving some words of the chance to ever survive a prune, or in some cases inefficiently leaving the final dict at a small fraction of the desired max_size.) Maybe it's necessary to prevent too-frequent prunes, but I'd prefer an option to be more precise via less-frequent prunes – especially if I'd hypothetically minimized the full-time-cost of Phrases-analysis via other process improvements or effective parallelism.

It looks like we're on to optimizations and choosing algorithms now, so let me tick off the "API design" checkbox.

API conclusion: mimic the Counter API, except add a parameter that allows restricting its maximum size, ideally in bytes. Make it clear approximations were necessary to achieve this, and what the trade-offs are (Counter methods we cannot support, perf-memory balance implications).

@jayantj @menshikh-iv do you agree?

@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size parameter.

Yes, though the best place for that is probably the docstring. Something like max_size==1GB gives you room for approximately 100,000,000 items, and the number scales linearly with more/less RAM (or whatever the case may be).

jayantj · 2017-06-25T17:19:24Z

gensim/models/fast_counter.py

+            self.min_reduce += 1
+            utils.prune_vocab(self.hash2cnt, self.min_reduce)
+
+    def get(self, item, default=None):


__getitem__ would be good too.

piskvorky · 2017-06-25T18:26:11Z

@jayantj Without max_size, what would be the difference to Counter? Why not just use Counter or defaultdict (which is already written in optimized C)?

I think the point of this class is that it can process very large iterables (corpora), so the Bounded part is what actually differentiates it.

menshikh-iv · 2017-06-26T06:25:40Z

gensim/models/fast_counter.py

+        self.hash2cnt = defaultdict(int)
+
+    def hash(self, item):
+        return hash(item)


Maybe use mmh3 for this purposes?

menshikh-iv · 2017-06-26T06:32:31Z

gensim/models/fast_counter.py

+        # XXX: Or use a fixed-size data structure to start with (hyperloglog?)
+        while self.max_size and len(self) > self.max_size:
+            self.min_reduce += 1
+            utils.prune_vocab(self.hash2cnt, self.min_reduce)


@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size parameter.

menshikh-iv · 2017-06-26T06:35:45Z

gensim/models/fast_counter_cython.pyx

+                for idx in range(1, l):
+                    h2 = chash(document[idx])
+                    hash2cnt[h2] += 1
+                    hash2cnt[h1 + h2] += 1


We can use h1 ** h2, but this will lead to overflow

piskvorky · 2017-06-30T06:13:41Z

Additional experiments and timings from @jayantj , building on top of this PR: https://github.com/jayantj/gensim/blob/fast_counter_modifications/docs/notebooks/counter_benchmarks.ipynb

In general, we want to be benchmarking on (much) larger corpora, such as the EN Wikipedia, so that the vocabulary pruning and memory limiting functionality have a chance to kick in.

menshikh-iv · 2017-08-31T07:40:14Z

Continued by @isamaru in a dedicated repo: https://github.com/RaRe-Technologies/bounded_counter

piskvorky force-pushed the fast_counter branch 3 times, most recently from 18ff52b to 2f69e10 Compare June 23, 2017 14:50

piskvorky changed the title ~~New object counting + Phrases~~ Fast object counting + Phrases Jun 23, 2017

piskvorky requested a review from jayantj June 23, 2017 14:52

piskvorky force-pushed the fast_counter branch from 2f69e10 to f1dce78 Compare June 23, 2017 14:59

wip: API design for fast_counter & phrases

392c672

piskvorky force-pushed the fast_counter branch from f1dce78 to 392c672 Compare June 23, 2017 15:01

remove document frequency from the API (instead, Dictionary will have…

6a98f86

… 2 counters)

piskvorky force-pushed the fast_counter branch from fd96ee8 to 6a98f86 Compare June 23, 2017 15:07

consistency: rename key to item

c601cbf

piskvorky force-pushed the fast_counter branch from 7243009 to 5e50922 Compare June 23, 2017 15:32

implement missing API methods

a4fdfdb

piskvorky force-pushed the fast_counter branch from 5e50922 to a4fdfdb Compare June 23, 2017 15:35

WIP: simple cython impl

24f5b63

piskvorky added 2 commits June 24, 2017 03:06

cythonized Phrases (untested, mildly optimized)

f87722a

replace Counter by defaultdict

df84033

piskvorky force-pushed the fast_counter branch from 9979ea7 to df84033 Compare June 23, 2017 18:41

gojomo reviewed Jun 23, 2017

View reviewed changes

jayantj reviewed Jun 25, 2017

View reviewed changes

menshikh-iv reviewed Jun 26, 2017

View reviewed changes

This was referenced Jun 26, 2017

Parallel scan_vocab #406

Closed

[WIP] Replace frequency counting using dict by a combination of hyperloglog and CountMinSketch #508

Closed

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Closed

prakhar2b mentioned this pull request Jun 29, 2017

[WIP] phrases multicore using joblib threading #1433

Closed

michaelwsherman mentioned this pull request Jul 5, 2017

Better support for evaluating threshold settings in models.phrases.Phrases #1465

Closed

menshikh-iv mentioned this pull request Aug 9, 2017

[WIP] Phrases: make any2utf8 optional #1413

Closed

menshikh-iv closed this Aug 31, 2017

menshikh-iv deleted the fast_counter branch October 6, 2017 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast object counting + Phrases #1446

Fast object counting + Phrases #1446

piskvorky commented Jun 23, 2017 •

edited

Loading

piskvorky commented Jun 23, 2017

piskvorky commented Jun 23, 2017 •

edited

Loading

gojomo Jun 23, 2017

gojomo Jun 23, 2017

piskvorky Jun 24, 2017 •

edited

Loading

menshikh-iv Jun 26, 2017

piskvorky Jun 26, 2017 •

edited

Loading

piskvorky Jun 27, 2017

piskvorky Jul 11, 2017

gojomo commented Jun 23, 2017

piskvorky commented Jun 24, 2017 •

edited

Loading

piskvorky commented Jun 24, 2017 •

edited

Loading

jayantj commented Jun 25, 2017 •

edited

Loading

jayantj Jun 25, 2017

piskvorky Jun 25, 2017

gojomo Jun 25, 2017

piskvorky Jun 26, 2017 •

edited

Loading

menshikh-iv Jun 26, 2017

piskvorky Jun 30, 2017 •

edited

Loading

jayantj Jun 25, 2017

piskvorky commented Jun 25, 2017 •

edited

Loading

menshikh-iv Jun 26, 2017

menshikh-iv Jun 26, 2017

menshikh-iv Jun 26, 2017

piskvorky commented Jun 30, 2017 •

edited

Loading

menshikh-iv commented Aug 31, 2017 •

edited by piskvorky

Loading

Fast object counting + Phrases #1446

Fast object counting + Phrases #1446

Conversation

piskvorky commented Jun 23, 2017 • edited Loading

piskvorky commented Jun 23, 2017

piskvorky commented Jun 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo commented Jun 23, 2017

piskvorky commented Jun 24, 2017 • edited Loading

piskvorky commented Jun 24, 2017 • edited Loading

jayantj commented Jun 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jun 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jun 30, 2017 • edited Loading

menshikh-iv commented Aug 31, 2017 • edited by piskvorky Loading

piskvorky commented Jun 23, 2017 •

edited

Loading

piskvorky commented Jun 23, 2017 •

edited

Loading

piskvorky Jun 24, 2017 •

edited

Loading

piskvorky Jun 26, 2017 •

edited

Loading

piskvorky commented Jun 24, 2017 •

edited

Loading

piskvorky commented Jun 24, 2017 •

edited

Loading

jayantj commented Jun 25, 2017 •

edited

Loading

piskvorky Jun 26, 2017 •

edited

Loading

piskvorky Jun 30, 2017 •

edited

Loading

piskvorky commented Jun 25, 2017 •

edited

Loading

piskvorky commented Jun 30, 2017 •

edited

Loading

menshikh-iv commented Aug 31, 2017 •

edited by piskvorky

Loading