Phrases multiprocessing #1141

AaronWang30 · 2017-02-11T03:27:31Z

Is that supported? Since when I'm running word2vec with phrases, the speed is quite slow and I got some info like

2017-02-10 22:26:42,404: INFO: PROGRESS: at 0.23% examples, 156666 words/s, in_qsize 0, out_qsize 0

where in_qsize and out_qsize is always 0 or 1. Is that normal?

piskvorky · 2017-03-10T11:30:26Z

Phrases are quite slow now, because they're written in pure Python. As such, word2vec training consumes its input faster than phrases can supply it (likely reason for the empty queues), so the word2vec training is "starved for data" and slower than it could be.

Rewriting Phrases in Cython/C for performance is pretty easy and high on our TODO list, a much needed feature.

If you want to help out with optimizing Phrases, @tmylk can direct you to the ticket / assist with PR.

patcollis34 · 2017-11-17T01:04:58Z

Hey any update on this? I took a quick glance over the code and I see a couple easy wins. First being possibly replace your defaultdict with a bounter, using cytoolz instead of functools and itertools and cythonize your scoring function. I would imagine that one of you have already thought about this and are working on something more C or "Cythony". @tmylk @piskvorky @AaronWang30

menshikh-iv · 2017-11-17T06:38:17Z

@patcollis34 unfortunately, for the current moment no any progress (our plans didn't take place). We also have the more global issue (not only about phrases) #1654 (bounter integration to all "counting" places).
I never heard about cytoolz, @piskvorky please have a look.
It will be very nice if you make PR and fix issue @patcollis34, what do you think?

piskvorky · 2017-11-17T09:08:23Z

Cythonizing + bounter ought to be enough, and seem easy to do too (just a few lines of code).

I think this would be a great simple intro PR.

Only one question: do we keep support for "pure Python" = when a person doesn't have a compiler, can they still install and use gensim? Or do we require a compiler (and provide binary wheels where possible)?

menshikh-iv · 2017-11-17T09:12:20Z

@piskvorky as for me, we should support "pure python" only if it works fine (stable, fast enough), phrases work OK by performance (not very fast, but it's possible to run it with big corpus).

Ideally, we need to have universal wheels for all platforms, then we can safely remove all "replaced pure python" parts.

mleue · 2018-05-21T10:24:07Z

Hey. I'd also be interested in speeding up the phraser.

I don't have much experience with cython but I quickly checked https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py and agree with the others that cythonizing the counter loop at https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L413 (for building the phrases object) and the parser loop at https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L535 (for actually using the phraser) should probably speed things up quite a bit both in building up the phrases and then also subsequently when using the phraser.
Would additional threading make sense? Off the top of my head I am imagining the same kind of process as how word2vec is implemented in gensim. We'd be feeding the sentences into a job queue, then threads pop off the sentences from there and do their work. In terms of results, I'm not sure if working on a global, locked counter would make sense, or rather have each thread build up their own counter and then merge them in the end.
If I understand bounter correctly it only helps with memory, not with speed. Would this still be a desired change in the same PR?

I've checked the contribution guidelines, but any additional "handholding", in terms of moving this issue forward, would be appreciated.

piskvorky · 2018-05-21T20:41:48Z

@mleue I think Cython is a great step. I wouldn't worry about parallelization (which is hard to do in Python), but see also #2048 .

Regarding Bounter: it helps with speed too (a little faster than dict). But yes, lower memory footprint is its main advantage.

But simply rewriting Phrases (the slow parts) in Cython sounds like an awesome first step. Easy wins first :)

patcollis34 · 2018-05-21T21:28:14Z

The reason I gave up on this is waiting on #1477, refactor the base corpus structure. To do some better/more interesting sampling methods across cores and solve the other bottleneck of the .any2utf8. Do you agree with this @piskvorky?

But the bounter/cython stuff should be helpful also. @mleue

mpickard-niu · 2021-08-27T15:44:48Z

Curious on the status of this. I have an application that would greatly benefit from multithreaded Phrases.

piskvorky · 2021-08-27T17:18:52Z

Still open, and a welcome addition!

I cleaned up the Phrases module in Gensim 4.0.0, but that wasn't related to performance optimizations. The phrases module is still largely unoptimized, but at least it's clean pure Python code now.

Seeing @patcollis34 's comment above – I wouldn't worry about or wait on #1477. Implementing Cython / Bounter into Phrases would be beneficial on its own. Assuming the resulting speed-up warrants the added complexity.

patcollis34 · 2021-08-29T02:20:00Z

I can't find docs on how to be allowed to push to a branch to make this suggestion/diffs more readable. but I'm thinking you could delete all .gets in the _learn_vocab for loop and have func return only the joined words and then rely on the stdlib Counter to do the counting and joining for you. sorry for the noob screenshots. thoughts?@piskvorky

piskvorky · 2021-08-29T11:34:15Z

Why, what would that give us?

patcollis34 · 2021-08-29T12:51:35Z

This makes it more map/"reducey" and pushes more things to the stdlib/c. removing 2 gets and +'s here makes this code a good bit faster. I was seeing about ~4x speedups with smaller corpus(~30k sentences) I was using. I forgot the single token append(426) in the first version. IMO this also paves the way for some to multiprocess it in an afternoon or so and might be able to swap the appends for yields and make this a generator

piskvorky · 2021-08-29T15:37:46Z

Thanks, a 4x speed up sounds promising. Can you open a PR, to explore your ideas?

I looked up the PR where I cleaned up the code in Phrases for Gensim 4.0. That PR included some timings too: #2976 (comment). So let's try your PR on the same corpus (text8), for easy comparison.

piskvorky closed this as completed Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrases multiprocessing #1141

Phrases multiprocessing #1141

AaronWang30 commented Feb 11, 2017

piskvorky commented Mar 10, 2017 •

edited

Loading

patcollis34 commented Nov 17, 2017 •

edited

Loading

menshikh-iv commented Nov 17, 2017 •

edited

Loading

piskvorky commented Nov 17, 2017 •

edited

Loading

menshikh-iv commented Nov 17, 2017 •

edited

Loading

mleue commented May 21, 2018 •

edited

Loading

piskvorky commented May 21, 2018 •

edited

Loading

patcollis34 commented May 21, 2018 •

edited

Loading

mpickard-niu commented Aug 27, 2021

piskvorky commented Aug 27, 2021 •

edited

Loading

patcollis34 commented Aug 29, 2021

piskvorky commented Aug 29, 2021

patcollis34 commented Aug 29, 2021 •

edited

Loading

piskvorky commented Aug 29, 2021 •

edited

Loading

Phrases multiprocessing #1141

Phrases multiprocessing #1141

Comments

AaronWang30 commented Feb 11, 2017

piskvorky commented Mar 10, 2017 • edited Loading

patcollis34 commented Nov 17, 2017 • edited Loading

menshikh-iv commented Nov 17, 2017 • edited Loading

piskvorky commented Nov 17, 2017 • edited Loading

menshikh-iv commented Nov 17, 2017 • edited Loading

mleue commented May 21, 2018 • edited Loading

piskvorky commented May 21, 2018 • edited Loading

patcollis34 commented May 21, 2018 • edited Loading

mpickard-niu commented Aug 27, 2021

piskvorky commented Aug 27, 2021 • edited Loading

patcollis34 commented Aug 29, 2021

piskvorky commented Aug 29, 2021

patcollis34 commented Aug 29, 2021 • edited Loading

piskvorky commented Aug 29, 2021 • edited Loading

piskvorky commented Mar 10, 2017 •

edited

Loading

patcollis34 commented Nov 17, 2017 •

edited

Loading

menshikh-iv commented Nov 17, 2017 •

edited

Loading

piskvorky commented Nov 17, 2017 •

edited

Loading

menshikh-iv commented Nov 17, 2017 •

edited

Loading

mleue commented May 21, 2018 •

edited

Loading

piskvorky commented May 21, 2018 •

edited

Loading

patcollis34 commented May 21, 2018 •

edited

Loading

piskvorky commented Aug 27, 2021 •

edited

Loading

patcollis34 commented Aug 29, 2021 •

edited

Loading

piskvorky commented Aug 29, 2021 •

edited

Loading