Approx min sketch #270

mfcabrera · 2014-12-18T16:39:02Z

First try of CMSketch. Still many things to fix:

Parameter settings , not clear way to do it or based on my test good defaults.
It is too "Slow" compared to the big memory version. Probably some optimization is necessary.
Not a clear way to compute the score for bigrams for the phrases detection part. The score calculation currently uses the length of the vocab but CMSketch has no length nor an approximate.
More testing is necessary.

working yet.

…ation

piskvorky · 2014-12-20T14:19:13Z

Thanks Miguel!

I'll be on holiday for the rest of the year, but will review once I get back. Let's ping Mr. @larsmans too.

Also, I came across another "online streamed counting" article, specifically for PMI: http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallNIPS09.pdf .

I only skimmed it quickly -- seems they are solving the task "get me the top-k pairs with the highest PMI score", which is maybe not so useful/applicable for us. What do you think?

piskvorky · 2015-01-11T22:27:19Z

I had a look, code looks OK, thanks. What to do with that len == 1 though?

I guess it doesn't affect bigram ranking (it's a constant multiplier), but the absolute magnitudes will be off.

There are standard streamed algos for counting the number of distinct elements too: http://www.cs.berkeley.edu/~satishr/cs270/sp11/rough-notes/Streaming.pdf , or https://research.facebook.com/publications/760790850639219/streamed-approximate-counting-of-distinct-elements/
http://antirez.com/news/75 (HyperLogLog)

I guess it's almost exactly the same problem as what you already solved with this PR, in fact a little easier :)

mfcabrera · 2015-05-29T09:03:31Z

Thanks!, after life got in the way for some months. I am gonna try to get this finally implemented. :). I will take a look at the links. Just for reference:

https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
http://www.davidwind.dk/?p=178

dav009 · 2015-06-21T10:58:17Z

def needed for the word2vec vocabulary count.
@mfcabrera :D let me know if I can be of help testing/reviewing this pr.

mfcabrera · 2015-08-03T19:24:05Z

I started using an existing implementation (pure Python) of HyperLogLog to fix the cardinality counting issue. The thing is too slow but I am going to profile/debug to see where we can improve performance. Maybe with the help Cython. So right now I am in the "make it run" step, ("make it good", "make it fast" ) to follow ;). See: https://github.com/svpcom/hyperloglog

tmylk · 2016-01-23T21:17:10Z

Continued in #508

Miguel Cabrera added 4 commits December 2, 2014 21:01

First buggy version of count min sketch

70fcbba

Second buggy version count min sketch. Too slow and it is not quite

67a0b5c

working yet.

Return 1 instead of 0 for the length of the CMSketch. Bit of document…

2064bb3

…ation

Use exact counting by default

56362ee

piskvorky mentioned this pull request Jun 19, 2015

Building word2vec vocabulary fails #361

Closed

gojomo mentioned this pull request Jul 6, 2015

Prune vocab #385

Merged

Merged last changes from develop into this feature branch

c4971c6

tmylk closed this Jan 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approx min sketch #270

Approx min sketch #270

mfcabrera commented Dec 18, 2014

piskvorky commented Dec 20, 2014

piskvorky commented Jan 11, 2015

mfcabrera commented May 29, 2015

dav009 commented Jun 21, 2015

mfcabrera commented Aug 3, 2015

tmylk commented Jan 23, 2016

Approx min sketch #270

Approx min sketch #270

Conversation

mfcabrera commented Dec 18, 2014

piskvorky commented Dec 20, 2014

piskvorky commented Jan 11, 2015

mfcabrera commented May 29, 2015

dav009 commented Jun 21, 2015

mfcabrera commented Aug 3, 2015

tmylk commented Jan 23, 2016