Fix scan vocab speed issue, build vocab from provided word frequencies #1599

jodevak · 2017-09-25T15:44:11Z

This request has two parts:
1-There was a noticeable speed issue with scan_vocab, and it turned out to be "sum(itervalues(vocab))" because this will iterate through the whole vocab on each completed "progress_per" iterations, which has a high speed cost when dealing with a big vocab. It took only 45 mins to iterate and build the whole vocab on 57 Gigabyte production ready words co-occurrences (window=1) with my modification, whereas it took 270 mins using old implementation.

2- Since build vocab is a single threaded operation, it would be very helpful to have a function that builds a word vocabulary from pre-given word frequencies (build_vocab_from_freq function), for example one could use Spark to count the words in a distributed way and then pipe the word frequencies to gensim word2vec.

…viously provided word frequencies table

…vided word frequencies table

menshikh-iv

Thanks for your PR @jodevak, please make small fixes and I'll merge your PR.

menshikh-iv · 2017-10-16T12:38:09Z

gensim/models/word2vec.py

        """
        self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
        self.finalize_vocab(update=update)  # build tables & arrays

+


Too many blank lines

menshikh-iv · 2017-10-16T12:38:40Z

gensim/models/word2vec.py

        """
        self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
        self.finalize_vocab(update=update)  # build tables & arrays

+
+    def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):
+        """


Please add documentation in numpy-style

menshikh-iv · 2017-10-16T12:39:11Z

gensim/models/word2vec.py

        """
        self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
        self.finalize_vocab(update=update)  # build tables & arrays

+
+    def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):


Add test for this method

menshikh-iv · 2017-10-16T12:41:02Z

gensim/models/word2vec.py

-                    "PROGRESS: at sentence #%i, processed %i words, keeping %i word types",
-                    sentence_no, sum(itervalues(vocab)) + total_words, len(vocab)
-                )
+                logger.info("PROGRESS: at sentence #%i, processed %i words, keeping %i word types",


Please use hanging indents

piskvorky · 2017-10-16T12:54:28Z

@jodevak I find 1) weird. Summing a bunch of values should be very fast, no matter the dictionary size.

What was your progress_per, how often did this summation occur (once every X seconds)?

jodevak · 2017-10-16T13:03:40Z

@piskvorky
1- My progress per was 10000, the vocab size was almost 2700000, and the total word co-occurrences were two and a half billion. I haven't measured the time for this operation alone, i just compared the total run time using both implementations.

2- Summing the values will require iterating over the whole dictionary values, in other words it means iterating over the whole stored word counts which will be definitely slower than incrementing a single counter.

@menshikh-iv
1- In progress 👍

…_vocab_from_freq, and hanging indents in build_vocab

piskvorky · 2017-10-16T18:35:16Z

Seems the progress_per is too low, that's not its intended use-case. What is the reason for this?

Btw we'll be replacing all the counting stuff by Bounter, so this will be moot.

Only needs some code style fixes (vertical indent) otherwise LGTM 👍

jodevak · 2017-10-16T18:40:55Z

@piskvorky i edited the comment, progress_per is 10000 which is the default value. I hope you give it a try on some random generated word-occurrences. Anyway Thank you :)

piskvorky · 2017-10-16T19:39:14Z

A sum of 2,700,000 dict values shouldn't take more than a few dozen milliseconds, and it's done only once every few seconds. Weird... but a timing is a timing!

In any case, Bounter keeps a .total() tally for free, so this will be irrelevant.

jodevak · 2017-10-16T19:58:42Z

@piskvorky Yes, using bounter would be more elegant. And in order to make the speed issue more clear , consider this code. Thanks.

`from time import time
from collections import defaultdict
from six import itervalues

g=defaultdict(int,dict(enumerate(range(2700000))))
stime=time()
sum(itervalues(g))
ftime=time()
h=((ftime-stime)/60)*(2500000000/10000)
print h

counter=0
stime=time()
counter+=1
ftime=time()
h=((ftime-stime)/60)*(2500000000/10000)
print h
`

menshikh-iv · 2017-10-17T04:45:18Z

gensim/models/word2vec.py

@@ -329,7 +333,8 @@ def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_h
    return neu1e


-def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True, compute_loss=False,
+def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True,


Please use hanging indents (instead of vertical), here and anywhere

"When using a hanging indent the following should be considered; there should be no arguments on the first line and further indentation should be used to clearly distinguish itself as a continuation line."

Is this what you need ? if yes, do you recommend any tool other than autopep8 that would auto format the file ?

Sorry, incorrect line
vertical indents OK for function/method definition, that's all. For other situations, in gensim, we used hanging indent.

Unfortunately, I can't recommend any tool for it (because we have no tool for check this condition yet), only manually

…espace

jodevak · 2017-10-17T12:44:31Z

@menshikh-iv , Is indentation accepted now ? if yes, i think only a test for build_vocab_from_freq is left to include. Many thanks.

menshikh-iv · 2017-10-17T13:05:43Z

@jodevak yes, please make needed test and that's all 👍

menshikh-iv · 2017-10-19T06:35:05Z

Congratz with first contribution @jodevak 🥇

jodevak · 2017-10-19T07:52:48Z

@menshikh-iv Thanks 👍

piskvorky

Already merged, but some changes needed.

piskvorky · 2017-10-19T14:27:44Z

gensim/models/word2vec.py

                )
            for word in sentence:
                vocab[word] += 1
+                total_words += 1


This is not a good idea, may be (unnecessarily) slow. Why not add the entire len(sentence) at once?

hmmm although it wont noticeably affect the speed, but yes it should be incrementing at once 👍

piskvorky · 2017-10-19T14:28:31Z

gensim/models/word2vec.py


            if self.max_vocab_size and len(vocab) > self.max_vocab_size:
-                total_words += utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule)
+                utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule)


I don't see any tests for this change during pruning, seems risky. Does it really work?

hmmm do you really think it needs a new test ? prunce_vocab has not been touched only the counter

Yes, definitely. You changed the semantics of how the total_words works; for example, the return value of utils.prune_vocab is ignored now.

It may be correct, but is not obvious to me and deserves an explicit check.

piskvorky · 2017-10-19T14:30:35Z

gensim/models/word2vec.py

+
+        Examples
+        --------
+        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)


Code style: PEP8. Also, this is an instance method (cannot be called without an object).

piskvorky · 2017-10-19T14:31:23Z

gensim/models/word2vec.py

+        --------
+        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
+        """
+        logger.info("Processing provided word frequencies")


Be more concrete in the log: what was provided to what? (how many entries, total frequencies?) Logs at INFO level are important, we want to make them count.

piskvorky · 2017-10-19T14:32:15Z

gensim/models/word2vec.py

+        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
+        """
+        logger.info("Processing provided word frequencies")
+        vocab = defaultdict(int, word_freq)


Won't this duplicate (double) the entire dictionary? Is it backward compatible in the sense that this refactoring won't consume much more memory?

Duplicating the entire vocab ? its just assigning a ready raw vocab (word count) dictionary. Is there a part im not getting ?

I don't think so. The defaultdict constructor will copy the entire contents of word_freq, which may be memory intensive for large vocabularies.

piskvorky · 2017-10-19T14:33:41Z

gensim/models/word2vec.py

+        self.corpus_count = corpus_count if corpus_count else 0
+        self.raw_vocab = vocab
+
+        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling


This function could use some comments and invariants: what's the relationship between vocab vs raw_vocab vs word_freq?

word_freq is the same as raw_vocab, Vocab is the same as word freq, so yes i think i should use a different naming.

menshikh-iv · 2017-10-25T06:35:57Z

can you create new PR and fix all comments @jodevak?

…1599) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces

jodevak · 2017-10-29T10:07:54Z

@menshikh-iv sure

menshikh-iv · 2017-11-06T14:58:38Z

Continued in #1695.

jodevak added 4 commits September 25, 2017 17:47

fix build vocab speed issue, and new function to build vocab from pre…

3f30e1e

…viously provided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

c4f387e

…vided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

8abd58b

…vided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

8ec0433

…vided word frequencies table

menshikh-iv suggested changes Oct 16, 2017

View reviewed changes

Removing the extra blank lines, documentation in numpy-style to build…

b9f3a5f

…_vocab_from_freq, and hanging indents in build_vocab

menshikh-iv suggested changes Oct 17, 2017

View reviewed changes

jodevak added 3 commits October 17, 2017 13:00

Fixing Indentation

0a5e8d6

Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whit…

644fcad

…espace

Remove trailing white spaces

c91b4cb

jodevak force-pushed the build_vocab_freq branch from 7b3b08a to c91b4cb Compare October 17, 2017 11:55

jodevak added 2 commits October 18, 2017 22:59

Adding test

1e4ef3e

fix spaces

9ae7a84

menshikh-iv merged commit e92b45d into piskvorky:develop Oct 19, 2017

piskvorky reviewed Oct 19, 2017

View reviewed changes

menshikh-iv added the style checking label Oct 19, 2017

menshikh-iv removed the style checking label Nov 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

jodevak commented Sep 25, 2017

menshikh-iv left a comment

menshikh-iv Oct 16, 2017

menshikh-iv Oct 16, 2017

menshikh-iv Oct 16, 2017

menshikh-iv Oct 16, 2017

piskvorky commented Oct 16, 2017 •

edited

Loading

jodevak commented Oct 16, 2017 •

edited

Loading

piskvorky commented Oct 16, 2017

jodevak commented Oct 16, 2017 •

edited

Loading

piskvorky commented Oct 16, 2017 •

edited

Loading

jodevak commented Oct 16, 2017 •

edited

Loading

menshikh-iv Oct 17, 2017

jodevak Oct 17, 2017

menshikh-iv Oct 17, 2017

jodevak commented Oct 17, 2017

menshikh-iv commented Oct 17, 2017 •

edited

Loading

menshikh-iv commented Oct 19, 2017

jodevak commented Oct 19, 2017 •

edited

Loading

piskvorky left a comment

piskvorky Oct 19, 2017 •

edited

Loading

jodevak Oct 23, 2017

piskvorky Oct 19, 2017

jodevak Oct 23, 2017

piskvorky Oct 24, 2017 •

edited

Loading

piskvorky Oct 19, 2017

piskvorky Oct 19, 2017 •

edited

Loading

piskvorky Oct 19, 2017

jodevak Oct 23, 2017

piskvorky Oct 24, 2017 •

edited

Loading

piskvorky Oct 19, 2017

jodevak Oct 23, 2017

menshikh-iv commented Oct 25, 2017

jodevak commented Oct 29, 2017

menshikh-iv commented Nov 6, 2017

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Conversation

jodevak commented Sep 25, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Oct 16, 2017 • edited Loading

jodevak commented Oct 16, 2017 • edited Loading

piskvorky commented Oct 16, 2017

jodevak commented Oct 16, 2017 • edited Loading

piskvorky commented Oct 16, 2017 • edited Loading

jodevak commented Oct 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jodevak commented Oct 17, 2017

menshikh-iv commented Oct 17, 2017 • edited Loading

menshikh-iv commented Oct 19, 2017

jodevak commented Oct 19, 2017 • edited Loading

piskvorky left a comment

Choose a reason for hiding this comment

piskvorky Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Oct 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Oct 25, 2017

jodevak commented Oct 29, 2017

menshikh-iv commented Nov 6, 2017

piskvorky commented Oct 16, 2017 •

edited

Loading

jodevak commented Oct 16, 2017 •

edited

Loading

jodevak commented Oct 16, 2017 •

edited

Loading

piskvorky commented Oct 16, 2017 •

edited

Loading

jodevak commented Oct 16, 2017 •

edited

Loading

menshikh-iv commented Oct 17, 2017 •

edited

Loading

jodevak commented Oct 19, 2017 •

edited

Loading

piskvorky Oct 19, 2017 •

edited

Loading

piskvorky Oct 24, 2017 •

edited

Loading

piskvorky Oct 19, 2017 •

edited

Loading

piskvorky Oct 24, 2017 •

edited

Loading