Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrases multiprocessing #1141

Closed
AaronWang30 opened this issue Feb 11, 2017 · 14 comments
Closed

Phrases multiprocessing #1141

AaronWang30 opened this issue Feb 11, 2017 · 14 comments

Comments

@AaronWang30
Copy link

Is that supported? Since when I'm running word2vec with phrases, the speed is quite slow and I got some info like

2017-02-10 22:26:42,404: INFO: PROGRESS: at 0.23% examples, 156666 words/s, in_qsize 0, out_qsize 0

where in_qsize and out_qsize is always 0 or 1. Is that normal?

@piskvorky
Copy link
Owner

piskvorky commented Mar 10, 2017

Phrases are quite slow now, because they're written in pure Python. As such, word2vec training consumes its input faster than phrases can supply it (likely reason for the empty queues), so the word2vec training is "starved for data" and slower than it could be.

Rewriting Phrases in Cython/C for performance is pretty easy and high on our TODO list, a much needed feature.

If you want to help out with optimizing Phrases, @tmylk can direct you to the ticket / assist with PR.

@patcollis34
Copy link

patcollis34 commented Nov 17, 2017

Hey any update on this? I took a quick glance over the code and I see a couple easy wins. First being possibly replace your defaultdict with a bounter, using cytoolz instead of functools and itertools and cythonize your scoring function. I would imagine that one of you have already thought about this and are working on something more C or "Cythony". @tmylk @piskvorky @AaronWang30

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 17, 2017

@patcollis34 unfortunately, for the current moment no any progress (our plans didn't take place). We also have the more global issue (not only about phrases) #1654 (bounter integration to all "counting" places).
I never heard about cytoolz, @piskvorky please have a look.
It will be very nice if you make PR and fix issue @patcollis34, what do you think?

@piskvorky
Copy link
Owner

piskvorky commented Nov 17, 2017

Cythonizing + bounter ought to be enough, and seem easy to do too (just a few lines of code).

I think this would be a great simple intro PR.

Only one question: do we keep support for "pure Python" = when a person doesn't have a compiler, can they still install and use gensim? Or do we require a compiler (and provide binary wheels where possible)?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 17, 2017

@piskvorky as for me, we should support "pure python" only if it works fine (stable, fast enough), phrases work OK by performance (not very fast, but it's possible to run it with big corpus).

Ideally, we need to have universal wheels for all platforms, then we can safely remove all "replaced pure python" parts.

@mleue
Copy link

mleue commented May 21, 2018

Hey. I'd also be interested in speeding up the phraser.

  1. I don't have much experience with cython but I quickly checked https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py and agree with the others that cythonizing the counter loop at https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L413 (for building the phrases object) and the parser loop at https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L535 (for actually using the phraser) should probably speed things up quite a bit both in building up the phrases and then also subsequently when using the phraser.
  2. Would additional threading make sense? Off the top of my head I am imagining the same kind of process as how word2vec is implemented in gensim. We'd be feeding the sentences into a job queue, then threads pop off the sentences from there and do their work. In terms of results, I'm not sure if working on a global, locked counter would make sense, or rather have each thread build up their own counter and then merge them in the end.
  3. If I understand bounter correctly it only helps with memory, not with speed. Would this still be a desired change in the same PR?

I've checked the contribution guidelines, but any additional "handholding", in terms of moving this issue forward, would be appreciated.

@piskvorky
Copy link
Owner

piskvorky commented May 21, 2018

@mleue I think Cython is a great step. I wouldn't worry about parallelization (which is hard to do in Python), but see also #2048 .

Regarding Bounter: it helps with speed too (a little faster than dict). But yes, lower memory footprint is its main advantage.

But simply rewriting Phrases (the slow parts) in Cython sounds like an awesome first step. Easy wins first :)

@patcollis34
Copy link

patcollis34 commented May 21, 2018

The reason I gave up on this is waiting on #1477, refactor the base corpus structure. To do some better/more interesting sampling methods across cores and solve the other bottleneck of the .any2utf8. Do you agree with this @piskvorky?

But the bounter/cython stuff should be helpful also. @mleue

@mpickard-niu
Copy link

Curious on the status of this. I have an application that would greatly benefit from multithreaded Phrases.

@piskvorky
Copy link
Owner

piskvorky commented Aug 27, 2021

Still open, and a welcome addition!

I cleaned up the Phrases module in Gensim 4.0.0, but that wasn't related to performance optimizations. The phrases module is still largely unoptimized, but at least it's clean pure Python code now.

Seeing @patcollis34 's comment above – I wouldn't worry about or wait on #1477. Implementing Cython / Bounter into Phrases would be beneficial on its own. Assuming the resulting speed-up warrants the added complexity.

@patcollis34
Copy link

I can't find docs on how to be allowed to push to a branch to make this suggestion/diffs more readable. but I'm thinking you could delete all .gets in the _learn_vocab for loop and have func return only the joined words and then rely on the stdlib Counter to do the counting and joining for you. sorry for the noob screenshots. thoughts?@piskvorky
image
image

@piskvorky
Copy link
Owner

Why, what would that give us?

@patcollis34
Copy link

patcollis34 commented Aug 29, 2021

This makes it more map/"reducey" and pushes more things to the stdlib/c. removing 2 gets and +'s here makes this code a good bit faster. I was seeing about ~4x speedups with smaller corpus(~30k sentences) I was using. I forgot the single token append(426) in the first version. IMO this also paves the way for some to multiprocess it in an afternoon or so and might be able to swap the appends for yields and make this a generator
image

@piskvorky
Copy link
Owner

piskvorky commented Aug 29, 2021

Thanks, a 4x speed up sounds promising. Can you open a PR, to explore your ideas?

I looked up the PR where I cleaned up the code in Phrases for Gensim 4.0. That PR included some timings too: #2976 (comment). So let's try your PR on the same corpus (text8), for easy comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants