Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] phrases multicore using joblib threading #1433

Closed
wants to merge 5 commits into from

Conversation

prakhar2b
Copy link
Contributor

No description provided.

@piskvorky
Copy link
Owner

piskvorky commented Jun 20, 2017

@prakhar2b did you talk to @menshikh-iv and @jayantj ?

Unfortunately this is not what we want.

@menshikh-iv
Copy link
Contributor

@piskvorky It's a part of GSoC proposal, label 1.4

@piskvorky
Copy link
Owner

piskvorky commented Jun 23, 2017

Yes, we want multicore, but joblib is not the right tool.

Joblib uses multiprocessing, and as I explained earlier, that is a bad choice of granularity when the operation to be done is as simple as incrementing a counter. The queueing/pickling/inter-process communication overhead will be enormous.

@jayantj
Copy link
Contributor

jayantj commented Jun 23, 2017

I completely agree that multiprocessing is not a good solution due to the overheads/copying involved. We discussed trying out a multi-threading approach instead (joblib seems to allow this, although the GIL will have to be deal with). One idea was to use libcuckoo since it seems to allow for concurrent read/writes.

@gojomo
Copy link
Collaborator

gojomo commented Jun 23, 2017

I suspect multiprocessing might be a competitive approach in the particular case where each process can open its own reader a into a disjoint range of the corpus – and thus the only IPC is tiny summary counts, not bulk ranges of text.

So it might only be a strategy where the corpus is large, and the user sophisticated enough to have already structured their corpus as some uncompressed file or set-of-many-smaller-files.

@piskvorky
Copy link
Owner

piskvorky commented Jun 24, 2017

Yes, that's the case where we create several counters independently and merge them at the end. I think that's the correct level of granularity for something as simple as incrementing a counter (but requires the user to have multiple input streams, rather than one, to parallelize well).

@prakhar2b prakhar2b changed the title [WIP] phrases multiprocessing using joblib [WIP] phrases multicore using joblib threading Jun 26, 2017
@prakhar2b
Copy link
Contributor Author

closing this PR as parallelizing using joblib threading doesn't improve the performance of pure python code and Phrases module has nothing much to cythonize other than static typing which doesn't result in desirable performance improvement.

Also, ref - this comment , this comment above

For fast counter, there is another PR #1446 in gensim , hopefully parallelizing will be better suited there.

@prakhar2b prakhar2b closed this Jun 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants