Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Merged
merged 10 commits into from
Oct 19, 2017

Conversation

jodevak
Copy link
Contributor

@jodevak jodevak commented Sep 25, 2017

This request has two parts:
1-There was a noticeable speed issue with scan_vocab, and it turned out to be "sum(itervalues(vocab))" because this will iterate through the whole vocab on each completed "progress_per" iterations, which has a high speed cost when dealing with a big vocab. It took only 45 mins to iterate and build the whole vocab on 57 Gigabyte production ready words co-occurrences (window=1) with my modification, whereas it took 270 mins using old implementation.

2- Since build vocab is a single threaded operation, it would be very helpful to have a function that builds a word vocabulary from pre-given word frequencies (build_vocab_from_freq function), for example one could use Spark to count the words in a distributed way and then pipe the word frequencies to gensim word2vec.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR @jodevak, please make small fixes and I'll merge your PR.

"""
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.finalize_vocab(update=update) # build tables & arrays


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many blank lines

"""
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.finalize_vocab(update=update) # build tables & arrays


def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation in numpy-style

"""
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.finalize_vocab(update=update) # build tables & arrays


def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=None, trim_rule=None, update=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test for this method

"PROGRESS: at sentence #%i, processed %i words, keeping %i word types",
sentence_no, sum(itervalues(vocab)) + total_words, len(vocab)
)
logger.info("PROGRESS: at sentence #%i, processed %i words, keeping %i word types",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indents

@piskvorky
Copy link
Owner

piskvorky commented Oct 16, 2017

@jodevak I find 1) weird. Summing a bunch of values should be very fast, no matter the dictionary size.

What was your progress_per, how often did this summation occur (once every X seconds)?

@jodevak
Copy link
Contributor Author

jodevak commented Oct 16, 2017

@piskvorky
1- My progress per was 10000, the vocab size was almost 2700000, and the total word co-occurrences were two and a half billion. I haven't measured the time for this operation alone, i just compared the total run time using both implementations.

2- Summing the values will require iterating over the whole dictionary values, in other words it means iterating over the whole stored word counts which will be definitely slower than incrementing a single counter.

@menshikh-iv
1- In progress 👍

…_vocab_from_freq, and hanging indents in build_vocab
@piskvorky
Copy link
Owner

Seems the progress_per is too low, that's not its intended use-case. What is the reason for this?

Btw we'll be replacing all the counting stuff by Bounter, so this will be moot.

Only needs some code style fixes (vertical indent) otherwise LGTM 👍

@jodevak
Copy link
Contributor Author

jodevak commented Oct 16, 2017

@piskvorky i edited the comment, progress_per is 10000 which is the default value. I hope you give it a try on some random generated word-occurrences. Anyway Thank you :)

@piskvorky
Copy link
Owner

piskvorky commented Oct 16, 2017

A sum of 2,700,000 dict values shouldn't take more than a few dozen milliseconds, and it's done only once every few seconds. Weird... but a timing is a timing!

In any case, Bounter keeps a .total() tally for free, so this will be irrelevant.

@jodevak
Copy link
Contributor Author

jodevak commented Oct 16, 2017

@piskvorky Yes, using bounter would be more elegant. And in order to make the speed issue more clear , consider this code. Thanks.

`from time import time
from collections import defaultdict
from six import itervalues

g=defaultdict(int,dict(enumerate(range(2700000))))
stime=time()
sum(itervalues(g))
ftime=time()
h=((ftime-stime)/60)*(2500000000/10000)
print h

counter=0
stime=time()
counter+=1
ftime=time()
h=((ftime-stime)/60)*(2500000000/10000)
print h
`

@@ -329,7 +333,8 @@ def train_sg_pair(model, word, context_index, alpha, learn_vectors=True, learn_h
return neu1e


def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True, compute_loss=False,
def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indents (instead of vertical), here and anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When using a hanging indent the following should be considered; there should be no arguments on the first line and further indentation should be used to clearly distinguish itself as a continuation line."

Is this what you need ? if yes, do you recommend any tool other than autopep8 that would auto format the file ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, incorrect line
vertical indents OK for function/method definition, that's all. For other situations, in gensim, we used hanging indent.

Unfortunately, I can't recommend any tool for it (because we have no tool for check this condition yet), only manually

@jodevak
Copy link
Contributor Author

jodevak commented Oct 17, 2017

@menshikh-iv , Is indentation accepted now ? if yes, i think only a test for build_vocab_from_freq is left to include. Many thanks.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 17, 2017

@jodevak yes, please make needed test and that's all 👍

@menshikh-iv
Copy link
Contributor

Congratz with first contribution @jodevak 🥇

@menshikh-iv menshikh-iv merged commit e92b45d into piskvorky:develop Oct 19, 2017
@jodevak
Copy link
Contributor Author

jodevak commented Oct 19, 2017

@menshikh-iv Thanks 👍

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already merged, but some changes needed.

)
for word in sentence:
vocab[word] += 1
total_words += 1
Copy link
Owner

@piskvorky piskvorky Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea, may be (unnecessarily) slow. Why not add the entire len(sentence) at once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm although it wont noticeably affect the speed, but yes it should be incrementing at once 👍


if self.max_vocab_size and len(vocab) > self.max_vocab_size:
total_words += utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule)
utils.prune_vocab(vocab, min_reduce, trim_rule=trim_rule)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any tests for this change during pruning, seems risky. Does it really work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm do you really think it needs a new test ? prunce_vocab has not been touched only the counter

Copy link
Owner

@piskvorky piskvorky Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. You changed the semantics of how the total_words works; for example, the return value of utils.prune_vocab is ignored now.

It may be correct, but is not obvious to me and deserves an explicit check.


Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: PEP8. Also, this is an instance method (cannot be called without an object).

--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
"""
logger.info("Processing provided word frequencies")
Copy link
Owner

@piskvorky piskvorky Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more concrete in the log: what was provided to what? (how many entries, total frequencies?) Logs at INFO level are important, we want to make them count.

>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
"""
logger.info("Processing provided word frequencies")
vocab = defaultdict(int, word_freq)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this duplicate (double) the entire dictionary? Is it backward compatible in the sense that this refactoring won't consume much more memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating the entire vocab ? its just assigning a ready raw vocab (word count) dictionary. Is there a part im not getting ?

Copy link
Owner

@piskvorky piskvorky Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The defaultdict constructor will copy the entire contents of word_freq, which may be memory intensive for large vocabularies.

self.corpus_count = corpus_count if corpus_count else 0
self.raw_vocab = vocab

self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could use some comments and invariants: what's the relationship between vocab vs raw_vocab vs word_freq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word_freq is the same as raw_vocab, Vocab is the same as word freq, so yes i think i should use a different naming.

@menshikh-iv
Copy link
Contributor

can you create new PR and fix all comments @jodevak?

horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017
…1599)

* fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab

* Fixing Indentation

* Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace

* Remove trailing white spaces

* Adding test

* fix spaces
@jodevak
Copy link
Contributor Author

jodevak commented Oct 29, 2017

@menshikh-iv sure

@menshikh-iv
Copy link
Contributor

Continued in #1695.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants