Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building word2vec vocabulary fails #361

Closed
shkarupa-alex opened this issue Jun 19, 2015 · 11 comments
Closed

Building word2vec vocabulary fails #361

shkarupa-alex opened this issue Jun 19, 2015 · 11 comments

Comments

@shkarupa-alex
Copy link

I'm trying to create word2vec model from 77Gb text corpus with 8 workers on MacBook with 16Gb of RAM

After 3 or 4 hours it prints in console:

2015-06-19 01:22:54,870 : INFO : PROGRESS: at sentence #1133940000, processed 5642208802 words and 61886452 word types
Killed: 9

I tried to reduce corpus size to ~50Gb but it dies anyway

2015-06-19 15:31:56,104 : INFO : collected 58126692 word types from a corpus of 4508742931 words and 1052839646 sentences
Killed: 9

Only when i use small corpus ~12Gb it works normal and proceed from building dictionary to calculating vectors

...
collected 7104901 word types from a corpus of 468066895 words and 33723072 sentences
total 1516862 word types after removing those with count<5
...

How can i determine what's wrong and where does error occures?

@piskvorky
Copy link
Owner

You're almost certainly running out of RAM. There's a code section in word2vec here that goes through your input and counts how many times each word appears in the whole corpus.

It then removes the not-so-frequent-words. But if there are too many different words in total, it will run out of memory and never get to the removing stage.

Simple solution is to remove the infrequent words more often, not only at the end. Same way it's done in Phrases here. We lose exactness (some counts may be off), but that probably doesn't matter much in practice.

The code is very simple -- can you do the same thing for word2vec and open a pull request? Or maybe even abstract both into a single method in gensim.utils, to avoid code duplication...

Also related: PR #270 for approximate counting. CC @mfcabrera @gojomo

@maxberggren
Copy link

I have the same problem and would love for this to be implemented.

@gojomo
Copy link
Collaborator

gojomo commented Jun 23, 2015

I see for @shkarupa-alex the failing 50GB corpus has 58 million unique 'words'. Also, the corpus that's 1/4th the size has less than 1/8th the unique words – a bit odd, since words are usually spread equally through a corpus. (Why does 3x the data introduce 7x the unique words?)

Maybe the data has a lot of 'noise' that's not really words – timestamps, UUIDs, etc? Since something's got to go, you may be able to remove it earlier, using domain knowledge of what's infrequent/less-meaningful in the data.

There are a few places to potentially lessen the peak RAM used during build_vocab steps – the Vocab objects could be smaller, the copy of items that survive the min_count cull could maybe shrink the source dict as it progresses. But even if some tweaks there get you over that hump, you'll need (vocab_size * dimensions * 4 bytes/float) memory for both the 'syn0' vectors-in-training and the 'syn1' (and/or 'syn1neg') hidden layer. (5 million 400-d words, with the matching syn1 layer, will use 16GB RAM, not even counting any of the other necessary structures.)

@shkarupa-alex
Copy link
Author

Why does 3x the data introduce 7x the unique words?

Small corpus is Russian Wikipedia.
Big corpus is search engine query log

@gojomo
Copy link
Collaborator

gojomo commented Jun 23, 2015

Aha, I see – I'd misinterpreted each of your progressively-smaller examples as being a truncation of the others. The disperse and fragmentary nature of queries is a bit different from the longer sentences (with more context) that most published word2vec results seem to have used. But, it's an interesting idea, and on the gensim list @pommedeterresautee pointed out a paper from Yahoo that's very relevant (maybe that was an inspiration?):

http://labs.yahoo.com/publication/querycategorizr-a-large-scale-semi-supervised-system-for-categorization-of-web-search-queries/

Note, though, that they describe their setup as: "Vector representations were trained for 60 million most frequent queries found in the search logs. Training was done using a machine with 256GB of RAM memory and 24 cores. Dimensionality of the embedding space was set to D = 300, while context neighborhood size was set to 5. Finally, we used 10 negative samples in each vector update."

And, Yahoo has another post – http://yahooeng.tumblr.com/post/118860853846/distributed-word2vec-on-top-of-pistachio – where they describe moving to a multi-machine system specifically to handle very large word2vec vocabularies (100 million words). So there are hints doing substantive query-log analysis may quickly require more than 16GB/1-machine...

@shkarupa-alex
Copy link
Author

Got an idea.

So, i got out of memory on very large search query log. The main reason are words itself. In this log they have a lot of misspelling. Model training fails at 1st stage - building vocabulary.

But what if we'll use probabilistic algorithm to count words? E.g. HotLogLog
If so, at 1st stage model will have smaller memory consumption.

Give your feedback please.

@piskvorky
Copy link
Owner

Pruning of word2vec vocabulary was already implemented; see the max_vocab_size parameter in the constructor.

@gojomo
Copy link
Collaborator

gojomo commented Sep 21, 2015

I have some concern the current pruning-while-scanning may be excessively imprecise: it may eliminate some more-frequent words, that just barely miss the escalating-threshold each prune, while retaining less-frequent words whose occurrences are front-loaded in the corpus.

Using a max_vocab_size much larger than your expected final size (after min_count trimming), but still small enough to fit in memory, will lessen this problem, as it will trigger fewer prune actions during the scan.

Also, see some prior-but-unfinished work to try a Count-Min-Sketch approximation count: #270.

@piskvorky
Copy link
Owner

Yes, as the docs say, use a generous max_vocab_size: each ~10 million is only about 1 GB of RAM. If you're planning on actually using such large vocabularies in word2vec, it's a drop in the RAM bucket.

@iarroyof
Copy link

I was wondering if there is an efficient method to load only metadata (pointers, probably) of the word vectors file.. Even for smaller corpora loading all vectors into RAM is very time consuming. Currently I have the same problem issue herein reported.

@gojomo
Copy link
Collaborator

gojomo commented May 18, 2017

@iarroyof - Is your issue hitting memory limits while scanning a corpus to discover its vocabulary (this issue), or loading a prebuilt vector set? If the latter, note there's an optional limit parameter to load_word2vec_format() for just loading some early subset of a Google word2vec.c-format file. (That's usually the most-frequent words, so those that would survive any discarding of less-frequent words.) Also, for gensim native .save() formats, if uncompressed, they can be re-.load()ed with the mmap='r' parameter, to memory-map the on-disk file into addressable space – so ranges of the word-vectors are only paged-in when accessed. Still, many applications of word-vectors require rapid random-access across all words, so any attempt to seek-on-disk per-word-access will be fatal to performance, and it's better to either (1) work with a smaller subset; or (2) buy/rent more RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants