-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approx min sketch #270
Approx min sketch #270
Conversation
Thanks Miguel! I'll be on holiday for the rest of the year, but will review once I get back. Let's ping Mr. @larsmans too. Also, I came across another "online streamed counting" article, specifically for PMI: http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallNIPS09.pdf . I only skimmed it quickly -- seems they are solving the task "get me the top-k pairs with the highest PMI score", which is maybe not so useful/applicable for us. What do you think? |
I had a look, code looks OK, thanks. What to do with that I guess it doesn't affect bigram ranking (it's a constant multiplier), but the absolute magnitudes will be off. There are standard streamed algos for counting the number of distinct elements too: http://www.cs.berkeley.edu/~satishr/cs270/sp11/rough-notes/Streaming.pdf , or https://research.facebook.com/publications/760790850639219/streamed-approximate-counting-of-distinct-elements/ I guess it's almost exactly the same problem as what you already solved with this PR, in fact a little easier :) |
Thanks!, after life got in the way for some months. I am gonna try to get this finally implemented. :). I will take a look at the links. Just for reference: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/ |
def needed for the word2vec vocabulary count. |
I started using an existing implementation (pure Python) of HyperLogLog to fix the cardinality counting issue. The thing is too slow but I am going to profile/debug to see where we can improve performance. Maybe with the help Cython. So right now I am in the "make it run" step, ("make it good", "make it fast" ) to follow ;). See: https://github.com/svpcom/hyperloglog |
Continued in #508 |
First try of CMSketch. Still many things to fix: