Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approx min sketch #270

Closed
wants to merge 5 commits into from
Closed

Conversation

mfcabrera
Copy link
Contributor

First try of CMSketch. Still many things to fix:

  1. Parameter settings , not clear way to do it or based on my test good defaults.
  2. It is too "Slow" compared to the big memory version. Probably some optimization is necessary.
  3. Not a clear way to compute the score for bigrams for the phrases detection part. The score calculation currently uses the length of the vocab but CMSketch has no length nor an approximate.
  4. More testing is necessary.

@piskvorky
Copy link
Owner

Thanks Miguel!

I'll be on holiday for the rest of the year, but will review once I get back. Let's ping Mr. @larsmans too.

Also, I came across another "online streamed counting" article, specifically for PMI: http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallNIPS09.pdf .

I only skimmed it quickly -- seems they are solving the task "get me the top-k pairs with the highest PMI score", which is maybe not so useful/applicable for us. What do you think?

@piskvorky
Copy link
Owner

I had a look, code looks OK, thanks. What to do with that len == 1 though?

I guess it doesn't affect bigram ranking (it's a constant multiplier), but the absolute magnitudes will be off.

There are standard streamed algos for counting the number of distinct elements too: http://www.cs.berkeley.edu/~satishr/cs270/sp11/rough-notes/Streaming.pdf , or https://research.facebook.com/publications/760790850639219/streamed-approximate-counting-of-distinct-elements/
http://antirez.com/news/75 (HyperLogLog)

I guess it's almost exactly the same problem as what you already solved with this PR, in fact a little easier :)

@mfcabrera
Copy link
Contributor Author

Thanks!, after life got in the way for some months. I am gonna try to get this finally implemented. :). I will take a look at the links. Just for reference:

https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
http://www.davidwind.dk/?p=178

@dav009
Copy link

dav009 commented Jun 21, 2015

def needed for the word2vec vocabulary count.
@mfcabrera :D let me know if I can be of help testing/reviewing this pr.

@gojomo gojomo mentioned this pull request Jul 6, 2015
@mfcabrera
Copy link
Contributor Author

I started using an existing implementation (pure Python) of HyperLogLog to fix the cardinality counting issue. The thing is too slow but I am going to profile/debug to see where we can improve performance. Maybe with the help Cython. So right now I am in the "make it run" step, ("make it good", "make it fast" ) to follow ;). See: https://github.com/svpcom/hyperloglog

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

Continued in #508

@tmylk tmylk closed this Jan 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants