Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast object counting + Phrases #1446

Closed
wants to merge 7 commits into from
Closed

Fast object counting + Phrases #1446

wants to merge 7 commits into from

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Jun 23, 2017

Create a new class for fast, memory-efficient object counting (bounded RAM footprint, even for super large corpora).

To be used in Phrases, Dictionary, Word2vec etc. Basically everywhere where a vocabulary larger than RAM can be expected (long tail, Zipf's law).

Perhaps we can release this as a separate, stand-alone Python library (widely useful even outside of gensim)?

  • design APIs
  • choose efficient algorithms and data structures
  • implement reference implementation in raw Python (slow but correct)
  • optimize hotspots in C/Cython
  • parallelize, if necessary
  • implement Dictionary, Phrases, Word2vec, Doc2vec using this new counter

@piskvorky piskvorky force-pushed the fast_counter branch 3 times, most recently from 18ff52b to 2f69e10 Compare June 23, 2017 14:50
@piskvorky piskvorky changed the title New object counting + Phrases Fast object counting + Phrases Jun 23, 2017
@piskvorky piskvorky requested a review from jayantj June 23, 2017 14:52
@piskvorky
Copy link
Owner Author

@gojomo @menshikh-iv @jayantj WDYT?

This could be a starting point for the new API, to be optimized and then used in Phrases and Dictionary.

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 23, 2017

Using preshed seems to bring a ~2x speedup over plain defaultdict (55s for FastCounter vs 11s for cythonized FastCounter vs 5s for cythonized Preshed, all on the text8 corpus).

"""
for document in documents:
for item in self.doc2items(document):
self.hash2cnt[self.hash(item)] += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this implementation ignores hash-collisions is important detail to document.

for idx in range(1, l):
h2 = chash(document[idx])
hash2cnt[h2] += 1
hash2cnt[h1 + h2] += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple addition to determine hash of bigram gives (A, B) and (B, A) same hash value – problematic for Phrase-promotion.

Copy link
Owner Author

@piskvorky piskvorky Jun 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had XOR here before, but then A ^ A == 0 == B ^ B, which is also not good. The hashing requires a little more thought, too hackish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use h1 ** h2, but this will lead to overflow

Copy link
Owner Author

@piskvorky piskvorky Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a native operation (instruction), so probably not a good idea.

Google shows me this for boost::hash_combine

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

That looks fancy, but probably anything that breaks the symmetry would be enough, such as 3*lhs + rhs.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a repo with a bunch of useful hashing algorithms & their comparison: https://github.com/rurban/smhasher

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more interesting resource on hashing (in the context of Bloom filters / min-sketch counting in #508 ): https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
CC @isamaru

@gojomo
Copy link
Collaborator

gojomo commented Jun 23, 2017

Worth testing against collections.Counter, too, for reference.

Certainly worth trying as drop-in for Phrases - and If/when cythonized code or preshed-ized code gives identical output on representative inputs, in less time or memory, replace existing steps.

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 24, 2017

More interested in feedback for the API now, not performance.

Is the interface complete, anything else we should support? Do you envision any issues when plugging this into Dictionary? Or into other use-cases that need fast&memory-efficient counting?

Btw I tried Counter too (see commit history), but it's quite a bit slower than defaultdict.

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 24, 2017

On second thoughts, how about simply mimicking the API of stdlib Counter?

Just call it BoundedCounter, slap on one more ctr parameter (max_size, either #items or #RAM bytes), and one more method (update_from_documents) and keep the rest standard?

Some Counter methods might not be supported in BoundedCounter, depending on what operations does the underlying data structure allow us to do, but it may be the path of least surprise for users.

@jayantj
Copy link
Contributor

jayantj commented Jun 25, 2017

About the API, I had the same thing in mind, using the same API as Counter is the most intuitive, I think. (though I'm not sure what stdlib Counter means, I can't find anything about it either - is it the Counter from collections?)
Simply updating counts from an iterable of hashables is the most common use case IMO, the update_from_documents is more specific to Gensim/Phrases. It might even be a good idea to keep update_from_documents out of FastCounter completely, and add a simple wrapper/subclass over FastCounter for Gensim with the method. The notion of "documents" is again specific to the Gensim use case, I think.

Also not sure about BoundedCounter since max_size should only be an optional param. The current FastCounter sounds good to me.

Also, it looks like you were really on fire! :)

# XXX: Or use a fixed-size data structure to start with (hyperloglog?)
while self.max_size and len(self) > self.max_size:
self.min_reduce += 1
utils.prune_vocab(self.hash2cnt, self.min_reduce)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially be very slow, as very utils.prune_vocab call is going to iterate over the entire vocabulary. Could be worthwhile to store a min_count property. Depending on the data structures used, this might not be trivial though.
Anyway, the discussion right now is API/design and not optimization, I suppose.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this min_count property do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to see profiling that indicates this survey is a major time-cost before spending much time optimizing. If it needed to be optimized, it might be possible to retain a min_reduce/min_count threshold and, during tallying, track when a key's tally exceeds this count, perhaps removing it from a 'most-eligible-for-pruning' list.

With this existing algorithm, where each prune, everything under the count reached last time is automatically discarded, this hypothetical "most eligible" list could be discarded 1st – without even looking at the existing counts again. Maybe that'd be enough... though if keeping this 'escalating min_reduce' you'd still need to iterate over all the remaining keys. (It seems a lot like GC, with a 'tenured' set that might be able to skip some collections.)

But as I've mentioned in previous discussions around Phrases: I suspect this 'escalating floor prune' is a somehat undesirable algorithm, injecting more imprecision/bias in final counts than necessary, and (maybe) losing a lot of keys that would survive a truly unbounded counting method. (TLDR: because of typical freq-distributions plus the always-escalating floow, every prune may wind up eliminating 90%+ of all existing words, starving some words of the chance to ever survive a prune, or in some cases inefficiently leaving the final dict at a small fraction of the desired max_size.) Maybe it's necessary to prevent too-frequent prunes, but I'd prefer an option to be more precise via less-frequent prunes – especially if I'd hypothetically minimized the full-time-cost of Phrases-analysis via other process improvements or effective parallelism.

Copy link
Owner Author

@piskvorky piskvorky Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're on to optimizations and choosing algorithms now, so let me tick off the "API design" checkbox.

API conclusion: mimic the Counter API, except add a parameter that allows restricting its maximum size, ideally in bytes. Make it clear approximations were necessary to achieve this, and what the trade-offs are (Counter methods we cannot support, perf-memory balance implications).

@jayantj @menshikh-iv do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size parameter.

Copy link
Owner Author

@piskvorky piskvorky Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though the best place for that is probably the docstring. Something like max_size==1GB gives you room for approximately 100,000,000 items, and the number scales linearly with more/less RAM (or whatever the case may be).

self.min_reduce += 1
utils.prune_vocab(self.hash2cnt, self.min_reduce)

def get(self, item, default=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__getitem__ would be good too.

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 25, 2017

@jayantj Without max_size, what would be the difference to Counter? Why not just use Counter or defaultdict (which is already written in optimized C)?

I think the point of this class is that it can process very large iterables (corpora), so the Bounded part is what actually differentiates it.

self.hash2cnt = defaultdict(int)

def hash(self, item):
return hash(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use mmh3 for this purposes?

# XXX: Or use a fixed-size data structure to start with (hyperloglog?)
while self.max_size and len(self) > self.max_size:
self.min_reduce += 1
utils.prune_vocab(self.hash2cnt, self.min_reduce)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky I agree with you
Also, maybe need to add a static method for a size calculation (for example, if we store X tokens what's megabytes we needed for this in RAM). It's can be useful if we choose a max_size parameter.

for idx in range(1, l):
h2 = chash(document[idx])
hash2cnt[h2] += 1
hash2cnt[h1 + h2] += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use h1 ** h2, but this will lead to overflow

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 30, 2017

Additional experiments and timings from @jayantj , building on top of this PR: https://github.com/jayantj/gensim/blob/fast_counter_modifications/docs/notebooks/counter_benchmarks.ipynb

In general, we want to be benchmarking on (much) larger corpora, such as the EN Wikipedia, so that the vocabulary pruning and memory limiting functionality have a chance to kick in.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Aug 31, 2017

Continued by @isamaru in a dedicated repo: https://github.com/RaRe-Technologies/bounded_counter

@menshikh-iv menshikh-iv deleted the fast_counter branch October 6, 2017 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants