-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Documentation fixes #3307
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #3307 +/- ##
===========================================
- Coverage 81.43% 79.53% -1.91%
===========================================
Files 122 68 -54
Lines 21052 11781 -9271
===========================================
- Hits 17144 9370 -7774
+ Misses 3908 2411 -1497
Continue to review full report at Codecov.
|
@gojomo the doc2vec runs finished: Unlike in the original notebook, the DBOW results no longer seem superior to DM now. Both modes show reasonable results, so I changed the conclusions and wording in the notebook accordingly. Perhaps the difference was due to some bug in the earlier versions of Gensim (the notebook is quite old). The next step is to run the notebook front-to-back, commit & merge. Please review and let me know if you'd have something reworded / changed / fixed, before the final run. |
Looks good overall. My main suggestion would be to add a parallel demonstration of a 3rd mode: pure DBOW (leaving the default IIRC, the very idea of mixing DBOW with interleaved (& largely-analogous) skipgram word-training traces back to this paper, in its innocent aside, "We also show a simple yet effective trick to improve Paragraph Vector. In particular, we observe that by jointly training word embeddings, as in the skip gram model, the quality of the paragraph vectors But the plain DBOW is also very fast, & very competitive on many tasks, if you just want full-doc vectors (without word-vectors). And the fact that the word-vectors still exist in that plain mode, but remain untrained random-initialized junk, is an occasional gotcha that it'd be useful to note. (My old IMDB-based Doc2Vec demo showed the nonsense results, when doing word-vector most-similars on a plain-DBOW model, to highlight this point.) In particular, the reason your existing DBOW+W vs DM comparison shows a significant speed advantage for DM is entirely the extra word-training. With a corpus of M words, in one epoch, plain DBOW does M innnermost-loop micro-predictions, each It's likely that DBOW alone would train fastest, and do reasonably well on all the tasks that work only with articles. Even if not adding a "pure DBOW" trial, it'd be good to be clear that what you're trying is the special DBOW plus skip-gram introduced by the 'Document Embeddings' paper, not plain DBOW. I notice you're using a non-default From my eyeballing of the paper's declared vector-sizes, it looks like they tried 100, 300, 1000, & 10000 – & reported the 10000-dimensional (!) doc-vectors as best-performing. 10k dimensions are probably overkill & impractical on your dev laptop... but their 10000 wasn't that much better than their 1000. You could probably manage that – especially if perhaps discarding more smaller articles (so there aren't 10M+ total doc-vectors to train). You could also aim for 300 dimensions – very common in word-vector work, and one of the sizes they tested. If you go for 200 not matching anything the paper tried, I'd make a comment in the notebook noting that you're doing that for memory efficiency, & it's still sufficient to demo the benefits, though projects with adequate resources may find it useful to try larger dimensionalities when copious training data is available. Asuming your M1 CPU reports 8 cores, it's possible that On a corpus this large, a non-default Note also that Finally, the paper seems to declare that they used HS mode, which you could match by specifying |
I didn't touch the pre-existing model settings. I assumed, given the notebook's explicit goal of "replicate the paper", that these were already set appropriately. So in addition to poor language and poor code, the setup was also incorrect? That's sad. But at least a chance to clean up properly now… Snowballing from the initial user report of
OK (dtto).
Ditto. Your suggestion makes good sense, I'll set the worker threads to 8 or 9; I use my 10-core laptop for other things too during the training anyway. Regarding all the following:
May I tempt you with a commit fix @gojomo? Clearly you're more knowledgable here so you set the "correct" params and then I run the cells. Let me know because otherwise I'll adapt what I can & merge, I don't want to stall the release, this notebook isn't that vital. Although after the clean up, maybe we could promote it to a tutorial @mpenkov, give the notebook more visibility, link it from the gallery? Are the long running times a problem? BTW, I see one of the CI tests failed again, could you have a look please. |
@gojomo I changed the settings as suggested (300 dim, more subsampling, default window etc) & re-ran the notebook: Good news: the Bad news: the results seem significantly worse. They no longer match the paper, nor the accompanying text. I'm starting to see why the original notebook author chose the settings they did. It's probably a result of some laborious experimentation, left undocumented. Which to be fair matches the original paper too, because that was done in a similarly under-documented manner. So, a meta-match. EDIT: And results for dim=100, window=8: https://github.com/RaRe-Technologies/gensim/blob/010a7ac745896e7b4607f1ff2b4507b835c3ddf9/docs/notebooks/doc2vec-wikipedia.ipynb Quite different again, although more meaningful than the dim=300 window=5 above IMO. |
It surprises me that a I can take a look & test/apply some other tweaks this week - but given the long cycles involved in full runs, even if it's just a couple hours of poking around, it'll likely be 4+ days of wall clock time. |
Thinking a bit more about it: That the 100d model "looks better" suggests to me the larger 300d model might've needed more So my vague hunch is that to the extent a smaller To the extent the 100d model is going to be faster-to-train & smaller, and seems sufficient to demonstrate the vivid results, we might as well focus on that, with side commentary noting that when time/resources permit, even larger models can perform better on rigorous evaluations, as the paper's {300, 1000, 10000} figure implies. (More generally, another reason for some drift in results, or other entries surging past the paper's, could easily be the emergence of newer acts in the later Wikipedia dumps we're using. So if matching-as-closely-as-possible were a top priority, using a similar vintage dump could be considered. But I think using the |
Yes, I added a mention of "7 years later" into the notebook as one possible reason for the divergence of results. I'm training |
Several DM runs have stalled (no progress, no CPU used). @gojomo is this a known bug, have you seen it before? Restarting usually helps, the issue appears in maybe 20% runs. Only in DM mode training btw, never in DBOW |
Never seen that. Suggestive of a queueing issue where either expected items are lost, or the cross-thread signaling has gone awry. I'd look in the console where the notebook-server/kernel is running to see if some sort of odd/fatal error happened in a worker thread preventing normal progress. |
I don't see anything out of ordinary in the server console: The cell "stopped" between 1:08 and 1:30 AM. I woke up and restarted the kernel around 8 AM. This is weird. The issue has become pretty much 100%, so I haven't been able to finish the notebook once the last couple of days. I wonder if it's some HW / OS issue. I'll try on a different machine. I'll skip DM for now. |
My first run of a version of your "100d" notebook also hit the DM-mode hang, at the end of the very-1st epoch, waiting forever for 4 of the threads to report they were finished. My environment is MacOS 12.3 / Python 3.9 on a pre-M1 MacBook Pro - so it's not unique to the M1-compiled-versions you're using. That this is now so quick to reproduce makes me suspect it's a regression. I still suspect worker threads are silently dying. |
I retrained on a Linux machine over Easter, several times. Everything worked fine there, no hang ups. So I suspect it's a Mac issue. Because the issue appears both pre-M1 (you) and on M1 (me), it's likely related to the OS. Maybe the MacOS BLAS, I remember it has some quirks. Either way, I'm done here. @gojomo please review https://github.com/RaRe-Technologies/gensim/blob/d872c02849af37812991ed72d69c8ed5725d1563/docs/notebooks/doc2vec-wikipedia.ipynb and I'll merge and we'll release the new Gensim. |
Various pre-release fixes for https://github.com/RaRe-Technologies/gensim/milestone/5:
__getitem__
etc) #3291