-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-based fast training for Any2Vec models #2127
Changes from 96 commits
39a2c11
20c22f7
dd0e9ca
6203c77
03bf799
660493f
1aedfe8
9ff0bb1
9e498b7
1d4a2a8
97bac7e
4de3a84
36d1412
625025b
8173da8
bd0a0e0
63663fa
2ee2405
8f8e817
ac28bbb
942a12f
2a44fbc
e4a8ba0
394a417
9ab6b1b
5d2e2cf
0489561
901cad4
c09035c
1e3c314
d6755be
cc4680c
9978f6b
35333dd
86b91ac
fca6f50
45ca084
16bb386
c83b96f
1a21b0b
dd83a3e
c72f0b6
74e51b3
58fc112
5e70184
9727782
d97ac0c
b6d7bb3
0c1fc5f
fd66e34
da9f3da
81329d6
f2ba633
51cec43
a72ddf1
aba7682
03d44b2
e4e8cb2
3e989de
d8c5cdc
2a42b85
002a60c
3850f49
c1e8a9b
7b7195b
3a8a915
6beb96a
a2eb5fc
57f7b66
83ce7c2
a3ede08
d38463e
ec4c677
a5311d2
f499d5b
645499c
dc1b98d
b9564e9
eefdd65
f669979
e80189f
cf6b032
87d8ea7
e2851b4
1fdaa43
c2fa0d8
5427416
7f7760b
926fd5e
df47983
0df7f6f
5fd1c99
d9257be
67c572c
8e82b9f
db2a77f
a96bc6d
3b4da64
53b967c
f57d1cb
b652afe
260cfb5
a433018
20ec49b
1ced17d
0731449
35f0ab4
49905f0
95c6ec9
aed2b6b
c1af621
33bf97a
e592b6a
d08e4c1
468a000
f71e1f8
811388b
ddd5901
a3490c7
a28ff0d
b2996f0
64bb617
816f63f
abad1b8
0b03839
6217c73
f70d159
9593d5f
7b714b2
b833f0f
bcc0fb9
384e0b1
527266f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,235 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Word2Vec Multistream API Tutorial\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Eleminate multistream here too There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
"\n", | ||
"This tutorial introduces new **`corpus_file`** argument for **`gensim.models.word2vec.Word2Vec`** model and how to use it. \n", | ||
"\n", | ||
"## Motivation\n", | ||
"\n", | ||
"Because standard Word2Vec training with `sentences` argument doesn't scale so well when the number of workers is large, special `corpus_file` argument was added to tackle this problem. Training with `corpus_file` yields **significant performance boost** (training is 370% faster with 32 workers in comparison to training with `sentences` argument). Also, it outruns performance of original Word2Vec C tool in terms of words/sec processing speed.\n", | ||
"\n", | ||
"While providing such benefits in performance, `corpus_file` argument accepts path to your corpus file which must be in a format of `gensim.models.word2vec.LineSentence` (one sentence per line, words are separated by whitespaces).\n", | ||
"\n", | ||
"\n", | ||
"**Note**: you have to build `gensim` with Cython optimizations (`gensim.models.word2vec.MULTISTREAM_VERSION >= 0`) in order to be able to use `corpus_file` argument." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fix constant There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## In this tutorial\n", | ||
"\n", | ||
"* We will show how to use the new API.\n", | ||
"* We compare performance of `corpus_file` vs. `sentences` arguments on English Wikipedia.\n", | ||
"* We will show that accuracies on `question-words.txt` are almost the same for both modes." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Usage is really simple\n", | ||
"\n", | ||
"You only need:\n", | ||
"\n", | ||
"1. Save your corpus in LineSentence format (you may use `gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file)` to save your corpus).\n", | ||
"2. Change `sentences=your_corpus` argument to `corpus_file=your_corpus_file` in `Word2Vec.__init__`, `Word2Vec.build_vocab`, `Word2Vec.train` calls.\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a little code example (with small corpus, like corpus = api.load("text8")
save_as_line_sentence(corpus, "my_corpus.txt")
model = Word2Vec(corpus_file="my_corpus.txt", ...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For what reason? I'm already showing the example of usage, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a good tone: me (as a user) open notebook and immediately see how to use it (small & simple code), if I need something else - I'll scroll down. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 agree |
||
"\n", | ||
"\n", | ||
"Short example:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import gensim.downloader as api\n", | ||
"from gensim.utils import save_as_line_sentence\n", | ||
"from gensim.models.word2vec import Word2Vec\n", | ||
"\n", | ||
"corpus = api.load(\"text8\")\n", | ||
"\n", | ||
"save_as_line_sentence(corpus, \"my_corpus.txt\")\n", | ||
"\n", | ||
"model = Word2Vec(corpus_file=\"my_corpus.txt\", iter=5, size=300, workers=14)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Let's prepare Wikipedia dataset\n", | ||
"\n", | ||
"We load wikipedia dump from `gensim-data`, perform preprocessing with gensim functions and save processed corpus in LineSentence format." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"CORPUS_FILE = 'wiki-en-20171001.txt'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import itertools\n", | ||
"from gensim.parsing.preprocessing import preprocess_string\n", | ||
"\n", | ||
"def processed_corpus():\n", | ||
" raw_corpus = api.load('wiki-english-20171001')\n", | ||
" for article in raw_corpus:\n", | ||
" doc = '\\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))\n", | ||
" yield preprocess_string(doc) \n", | ||
"\n", | ||
"save_as_line_sentence(processed_corpus(), CORPUS_FILE)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Train Word2Vec\n", | ||
"\n", | ||
"We train two models:\n", | ||
"* With `sentences` argument\n", | ||
"* With `corpus_file` argument\n", | ||
"\n", | ||
"\n", | ||
"Then, we compare timings and accuracy on `question-words.txt`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from gensim.models.word2vec import LineSentence\n", | ||
"import time\n", | ||
"\n", | ||
"st_time = time.time()\n", | ||
"model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)\n", | ||
"model_sent_training_time = time.time() - st_time\n", | ||
"\n", | ||
"st_time = time.time()\n", | ||
"model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)\n", | ||
"model_corp_file_training_time = time.time() - st_time" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Training model with `sentences` took 8711.613\n", | ||
"Training model with `corpus_file` took 2367.976 seconds\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(\"Training model with `sentences` took {:.3f}\".format(model_sent_training_time))\n", | ||
"print(\"Training model with `corpus_file` took {:.3f} seconds\".format(model_corp_file_training_time))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Training with `corpus_file` took 3.7x less time!\n", | ||
"\n", | ||
"#### Now, let's compare the accuracies." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from gensim.test.utils import datapath" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n", | ||
" if np.issubdtype(vec.dtype, np.int):\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Word analogy accuracy with `sentences`: 0.754\n", | ||
"Word analogy accuracy with `corpus_file`: 0.744\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", | ||
"print(\"Word analogy accuracy with `sentences`: {:.3f}\".format(model_sent_accuracy))\n", | ||
"\n", | ||
"model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n", | ||
"print(\"Word analogy accuracy with `corpus_file`: {:.3f}\".format(model_corp_file_accuracy))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Accuracies are approximately the same." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.5" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to add doc2vec and fasttext here too (to both part)