Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based fast training for Any2Vec models #2127

Merged
merged 133 commits into from
Sep 14, 2018
Merged
Show file tree
Hide file tree
Changes from 96 commits
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
39a2c11
CythonLineSentence
Jul 9, 2018
20c22f7
fix
Jul 9, 2018
dd0e9ca
fix setup.py
Jul 9, 2018
6203c77
fixes
Jul 9, 2018
03bf799
some refactoring
Jul 9, 2018
660493f
remove printf
Jul 10, 2018
1aedfe8
compiled
Jul 10, 2018
9ff0bb1
second branch for pystreams
Jul 10, 2018
9e498b7
fix
Jul 10, 2018
1d4a2a8
learning rate decay in Cython + _do_train_epoch + _train_epoch_multis…
Jul 11, 2018
97bac7e
add train_epoch_sg function
Jul 11, 2018
4de3a84
call _train_epoch_multistream from train()
Jul 11, 2018
36d1412
add word2vec_inner.cpp
Jul 11, 2018
625025b
remove pragma from .cpp
Jul 11, 2018
8173da8
Merge branch 'develop' into feature/multistream-training
Jul 12, 2018
bd0a0e0
fix doc
Jul 12, 2018
63663fa
fix pip
Jul 12, 2018
2ee2405
add __reduce__ to CythonLineSentence for proper pickling
Jul 14, 2018
8f8e817
remove printf
Jul 14, 2018
ac28bbb
add 1 test for CythonLineSentence
Jul 14, 2018
942a12f
no vocab copying
Jul 18, 2018
2a44fbc
fixed
Jul 18, 2018
e4a8ba0
Revert "fixed"
Jul 19, 2018
394a417
Revert "no vocab copying"
Jul 19, 2018
9ab6b1b
remove input_streams, add corpus_file
Jul 24, 2018
5d2e2cf
fix
Jul 24, 2018
0489561
fix replacing input_streams -> corpus_file in Word2Vec class
Jul 24, 2018
901cad4
upd .cpp
Jul 26, 2018
c09035c
add C++11 compiler flags
Jul 26, 2018
1e3c314
pep8
Jul 26, 2018
d6755be
add link args too
Jul 26, 2018
cc4680c
upd FastLineSentence
Jul 26, 2018
9978f6b
fix signatures in doc2vec/fasttext + removed tests on multistream
Jul 26, 2018
35333dd
fix flake
Jul 26, 2018
86b91ac
clean up base_any2vec.py
Jul 26, 2018
fca6f50
fix
Jul 26, 2018
45ca084
fix CythonLineSentence ctor
Jul 26, 2018
16bb386
fix py3 type error
Jul 26, 2018
c83b96f
fix again
Jul 26, 2018
1a21b0b
try again
Jul 26, 2018
dd83a3e
new error
Jul 26, 2018
c72f0b6
fix test
Jul 27, 2018
74e51b3
add unordered_map wrapper
Jul 30, 2018
58fc112
upd
Jul 30, 2018
5e70184
fix cython compiling errors
Jul 30, 2018
9727782
upd word2vec_inner.cpp
Jul 30, 2018
d97ac0c
add some tests
Jul 31, 2018
b6d7bb3
more tests for corpus_file
Jul 31, 2018
0c1fc5f
fix docstrings
Jul 31, 2018
fd66e34
addressing comments
Aug 1, 2018
da9f3da
fix tests skipIf
Aug 1, 2018
81329d6
add persistence test
Aug 1, 2018
f2ba633
online learning tests
Aug 1, 2018
51cec43
fix save_as_line_sentence
Aug 1, 2018
a72ddf1
fix again
Aug 1, 2018
aba7682
address new comments
Aug 2, 2018
03d44b2
fix test
Aug 2, 2018
e4e8cb2
move multistream functions from word2vec_inner to word2vec_multistream
Aug 2, 2018
3e989de
fix tests
Aug 2, 2018
d8c5cdc
add .c file
Aug 3, 2018
2a42b85
fix test
Aug 3, 2018
002a60c
fix tests skipIf and setup.py
Aug 3, 2018
3850f49
fix mac os compatibility
Aug 3, 2018
c1e8a9b
add tutorial on w2v multistream
Aug 9, 2018
7b7195b
300% -> 200% in notebook
Aug 10, 2018
3a8a915
add MULTISTREAM_VERSION global constant
Aug 10, 2018
6beb96a
first move towards multistream FastText
Aug 10, 2018
a2eb5fc
move MULTISTREAM_VERSION
Aug 10, 2018
57f7b66
fix error
Aug 10, 2018
83ce7c2
fix CythonVocab
Aug 10, 2018
a3ede08
regenerated .c & .cpp files
Aug 10, 2018
d38463e
resolve ambiguate fast_sentence_* declarations
Aug 11, 2018
ec4c677
add test_training_multistream for fasttext
Aug 11, 2018
a5311d2
add skipif
Aug 11, 2018
f499d5b
add more tests
Aug 11, 2018
645499c
fix flake8
Aug 11, 2018
dc1b98d
add short example
Aug 12, 2018
b9564e9
upd jupyter notebook
Aug 13, 2018
eefdd65
fix docstrings in doc2vec
Aug 14, 2018
f669979
add d2v_train_epoch_dbow for from-file training
Aug 14, 2018
e80189f
add missing parts of from-file doc2vec
Aug 15, 2018
cf6b032
refactored a bit
Aug 15, 2018
87d8ea7
add total_corpus_count calculation in doc2vec
Aug 15, 2018
e2851b4
Merge branch 'develop' into feature/multistream-training
persiyanov Aug 15, 2018
1fdaa43
add tests for doc2vec file-based + rename MULTISTREAM -> CORPUSFILE e…
Aug 15, 2018
c2fa0d8
regenerated .c + .cpp files
Aug 15, 2018
5427416
add Word2VecConfig in order to remove repeating parts of code
Aug 15, 2018
7f7760b
make shared initialization
Aug 15, 2018
926fd5e
use init_config from word2vec_corpusfile
Aug 15, 2018
df47983
add FastTextConfig
Aug 15, 2018
0df7f6f
init_config -> init_w2v_config, init_ft_config
Aug 15, 2018
5fd1c99
regenerated .c & .cpp files
Aug 15, 2018
d9257be
using FastTextConfig in fasttext_corpusfile.pyx
Aug 15, 2018
67c572c
fix
Aug 15, 2018
8e82b9f
fix
Aug 15, 2018
db2a77f
fix next_random in w2v
Aug 15, 2018
a96bc6d
introduce Doc2VecConfig
Aug 16, 2018
3b4da64
fix init_d2v_config
Aug 16, 2018
53b967c
use Doc2VecConfig in doc2vec_corpusfile.pyx
Aug 16, 2018
f57d1cb
removed unused vars
Aug 16, 2018
b652afe
fix docstrings
Aug 16, 2018
260cfb5
fix more docstrings
Aug 16, 2018
a433018
test old model for doc2vec & fasttext
Aug 16, 2018
20ec49b
fix loading old models
Aug 16, 2018
1ced17d
fix fasttext model checking
Aug 16, 2018
0731449
merge fast_line_sentence.cpp and fast_line_sentence.h
Aug 16, 2018
35f0ab4
fix word2vec test
Aug 16, 2018
49905f0
fix syntax error
Aug 16, 2018
95c6ec9
remove redundanta seekg call
Aug 16, 2018
aed2b6b
fix example notebook
Aug 16, 2018
c1af621
add initial doc_tags computation
Aug 16, 2018
33bf97a
fix test
Aug 16, 2018
e592b6a
fix test for windows
Aug 17, 2018
d08e4c1
add one more test on offsets
Aug 17, 2018
468a000
get rid of subword_arrays in fasttext
Aug 17, 2018
f71e1f8
make hanging indents everywhere
Aug 17, 2018
811388b
open file in byte mode
Aug 18, 2018
ddd5901
fix pep
Aug 18, 2018
a3490c7
fix tests
Aug 18, 2018
a28ff0d
fix again
Aug 18, 2018
b2996f0
final fix?
Aug 18, 2018
64bb617
regenerated .c & .cpp files
Aug 18, 2018
816f63f
fix test_persistence_fromfile for FastText
Aug 18, 2018
abad1b8
add fasttext & doc2vec to notebook
Aug 20, 2018
0b03839
add short examples
Aug 20, 2018
6217c73
update file-based tutorial notebook
piskvorky Aug 23, 2018
f70d159
work credit + minor nb fixes
piskvorky Aug 25, 2018
9593d5f
remove FIXMEs from file-based *2vec notebook
piskvorky Sep 9, 2018
7b714b2
remove warnings in corpus_file mode
persiyanov Sep 9, 2018
b833f0f
fix deprecation warning
menshikh-iv Sep 12, 2018
bcc0fb9
regenerate .ipynb
persiyanov Sep 14, 2018
384e0b1
upd plot
persiyanov Sep 14, 2018
527266f
upd plot
persiyanov Sep 14, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions docs/notebooks/Word2Vec_Multistream.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add doc2vec and fasttext here too (to both part)

  • short example
  • wiki bench

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Word2Vec Multistream API Tutorial\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eleminate multistream here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"\n",
"This tutorial introduces new **`corpus_file`** argument for **`gensim.models.word2vec.Word2Vec`** model and how to use it. \n",
"\n",
"## Motivation\n",
"\n",
"Because standard Word2Vec training with `sentences` argument doesn't scale so well when the number of workers is large, special `corpus_file` argument was added to tackle this problem. Training with `corpus_file` yields **significant performance boost** (training is 370% faster with 32 workers in comparison to training with `sentences` argument). Also, it outruns performance of original Word2Vec C tool in terms of words/sec processing speed.\n",
"\n",
"While providing such benefits in performance, `corpus_file` argument accepts path to your corpus file which must be in a format of `gensim.models.word2vec.LineSentence` (one sentence per line, words are separated by whitespaces).\n",
"\n",
"\n",
"**Note**: you have to build `gensim` with Cython optimizations (`gensim.models.word2vec.MULTISTREAM_VERSION >= 0`) in order to be able to use `corpus_file` argument."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## In this tutorial\n",
"\n",
"* We will show how to use the new API.\n",
"* We compare performance of `corpus_file` vs. `sentences` arguments on English Wikipedia.\n",
"* We will show that accuracies on `question-words.txt` are almost the same for both modes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage is really simple\n",
"\n",
"You only need:\n",
"\n",
"1. Save your corpus in LineSentence format (you may use `gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file)` to save your corpus).\n",
"2. Change `sentences=your_corpus` argument to `corpus_file=your_corpus_file` in `Word2Vec.__init__`, `Word2Vec.build_vocab`, `Word2Vec.train` calls.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little code example (with small corpus, like text8) is good idea here, something like

corpus = api.load("text8")

save_as_line_sentence(corpus, "my_corpus.txt")

model = Word2Vec(corpus_file="my_corpus.txt", ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what reason? I'm already showing the example of usage, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good tone: me (as a user) open notebook and immediately see how to use it (small & simple code), if I need something else - I'll scroll down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 agree

"\n",
"\n",
"Short example:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import gensim.downloader as api\n",
"from gensim.utils import save_as_line_sentence\n",
"from gensim.models.word2vec import Word2Vec\n",
"\n",
"corpus = api.load(\"text8\")\n",
"\n",
"save_as_line_sentence(corpus, \"my_corpus.txt\")\n",
"\n",
"model = Word2Vec(corpus_file=\"my_corpus.txt\", iter=5, size=300, workers=14)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's prepare Wikipedia dataset\n",
"\n",
"We load wikipedia dump from `gensim-data`, perform preprocessing with gensim functions and save processed corpus in LineSentence format."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"CORPUS_FILE = 'wiki-en-20171001.txt'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import itertools\n",
"from gensim.parsing.preprocessing import preprocess_string\n",
"\n",
"def processed_corpus():\n",
" raw_corpus = api.load('wiki-english-20171001')\n",
" for article in raw_corpus:\n",
" doc = '\\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))\n",
" yield preprocess_string(doc) \n",
"\n",
"save_as_line_sentence(processed_corpus(), CORPUS_FILE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train Word2Vec\n",
"\n",
"We train two models:\n",
"* With `sentences` argument\n",
"* With `corpus_file` argument\n",
"\n",
"\n",
"Then, we compare timings and accuracy on `question-words.txt`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gensim.models.word2vec import LineSentence\n",
"import time\n",
"\n",
"st_time = time.time()\n",
"model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)\n",
"model_sent_training_time = time.time() - st_time\n",
"\n",
"st_time = time.time()\n",
"model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)\n",
"model_corp_file_training_time = time.time() - st_time"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training model with `sentences` took 8711.613\n",
"Training model with `corpus_file` took 2367.976 seconds\n"
]
}
],
"source": [
"print(\"Training model with `sentences` took {:.3f}\".format(model_sent_training_time))\n",
"print(\"Training model with `corpus_file` took {:.3f} seconds\".format(model_corp_file_training_time))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Training with `corpus_file` took 3.7x less time!\n",
"\n",
"#### Now, let's compare the accuracies."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from gensim.test.utils import datapath"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
" if np.issubdtype(vec.dtype, np.int):\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Word analogy accuracy with `sentences`: 0.754\n",
"Word analogy accuracy with `corpus_file`: 0.744\n"
]
}
],
"source": [
"model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n",
"print(\"Word analogy accuracy with `sentences`: {:.3f}\".format(model_sent_accuracy))\n",
"\n",
"model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n",
"print(\"Word analogy accuracy with `corpus_file`: {:.3f}\".format(model_corp_file_accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Accuracies are approximately the same."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading