Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based fast training for Any2Vec models #2127

Merged
merged 133 commits into from
Sep 14, 2018
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
39a2c11
CythonLineSentence
Jul 9, 2018
20c22f7
fix
Jul 9, 2018
dd0e9ca
fix setup.py
Jul 9, 2018
6203c77
fixes
Jul 9, 2018
03bf799
some refactoring
Jul 9, 2018
660493f
remove printf
Jul 10, 2018
1aedfe8
compiled
Jul 10, 2018
9ff0bb1
second branch for pystreams
Jul 10, 2018
9e498b7
fix
Jul 10, 2018
1d4a2a8
learning rate decay in Cython + _do_train_epoch + _train_epoch_multis…
Jul 11, 2018
97bac7e
add train_epoch_sg function
Jul 11, 2018
4de3a84
call _train_epoch_multistream from train()
Jul 11, 2018
36d1412
add word2vec_inner.cpp
Jul 11, 2018
625025b
remove pragma from .cpp
Jul 11, 2018
8173da8
Merge branch 'develop' into feature/multistream-training
Jul 12, 2018
bd0a0e0
fix doc
Jul 12, 2018
63663fa
fix pip
Jul 12, 2018
2ee2405
add __reduce__ to CythonLineSentence for proper pickling
Jul 14, 2018
8f8e817
remove printf
Jul 14, 2018
ac28bbb
add 1 test for CythonLineSentence
Jul 14, 2018
942a12f
no vocab copying
Jul 18, 2018
2a44fbc
fixed
Jul 18, 2018
e4a8ba0
Revert "fixed"
Jul 19, 2018
394a417
Revert "no vocab copying"
Jul 19, 2018
9ab6b1b
remove input_streams, add corpus_file
Jul 24, 2018
5d2e2cf
fix
Jul 24, 2018
0489561
fix replacing input_streams -> corpus_file in Word2Vec class
Jul 24, 2018
901cad4
upd .cpp
Jul 26, 2018
c09035c
add C++11 compiler flags
Jul 26, 2018
1e3c314
pep8
Jul 26, 2018
d6755be
add link args too
Jul 26, 2018
cc4680c
upd FastLineSentence
Jul 26, 2018
9978f6b
fix signatures in doc2vec/fasttext + removed tests on multistream
Jul 26, 2018
35333dd
fix flake
Jul 26, 2018
86b91ac
clean up base_any2vec.py
Jul 26, 2018
fca6f50
fix
Jul 26, 2018
45ca084
fix CythonLineSentence ctor
Jul 26, 2018
16bb386
fix py3 type error
Jul 26, 2018
c83b96f
fix again
Jul 26, 2018
1a21b0b
try again
Jul 26, 2018
dd83a3e
new error
Jul 26, 2018
c72f0b6
fix test
Jul 27, 2018
74e51b3
add unordered_map wrapper
Jul 30, 2018
58fc112
upd
Jul 30, 2018
5e70184
fix cython compiling errors
Jul 30, 2018
9727782
upd word2vec_inner.cpp
Jul 30, 2018
d97ac0c
add some tests
Jul 31, 2018
b6d7bb3
more tests for corpus_file
Jul 31, 2018
0c1fc5f
fix docstrings
Jul 31, 2018
fd66e34
addressing comments
Aug 1, 2018
da9f3da
fix tests skipIf
Aug 1, 2018
81329d6
add persistence test
Aug 1, 2018
f2ba633
online learning tests
Aug 1, 2018
51cec43
fix save_as_line_sentence
Aug 1, 2018
a72ddf1
fix again
Aug 1, 2018
aba7682
address new comments
Aug 2, 2018
03d44b2
fix test
Aug 2, 2018
e4e8cb2
move multistream functions from word2vec_inner to word2vec_multistream
Aug 2, 2018
3e989de
fix tests
Aug 2, 2018
d8c5cdc
add .c file
Aug 3, 2018
2a42b85
fix test
Aug 3, 2018
002a60c
fix tests skipIf and setup.py
Aug 3, 2018
3850f49
fix mac os compatibility
Aug 3, 2018
c1e8a9b
add tutorial on w2v multistream
Aug 9, 2018
7b7195b
300% -> 200% in notebook
Aug 10, 2018
3a8a915
add MULTISTREAM_VERSION global constant
Aug 10, 2018
6beb96a
first move towards multistream FastText
Aug 10, 2018
a2eb5fc
move MULTISTREAM_VERSION
Aug 10, 2018
57f7b66
fix error
Aug 10, 2018
83ce7c2
fix CythonVocab
Aug 10, 2018
a3ede08
regenerated .c & .cpp files
Aug 10, 2018
d38463e
resolve ambiguate fast_sentence_* declarations
Aug 11, 2018
ec4c677
add test_training_multistream for fasttext
Aug 11, 2018
a5311d2
add skipif
Aug 11, 2018
f499d5b
add more tests
Aug 11, 2018
645499c
fix flake8
Aug 11, 2018
dc1b98d
add short example
Aug 12, 2018
b9564e9
upd jupyter notebook
Aug 13, 2018
eefdd65
fix docstrings in doc2vec
Aug 14, 2018
f669979
add d2v_train_epoch_dbow for from-file training
Aug 14, 2018
e80189f
add missing parts of from-file doc2vec
Aug 15, 2018
cf6b032
refactored a bit
Aug 15, 2018
87d8ea7
add total_corpus_count calculation in doc2vec
Aug 15, 2018
e2851b4
Merge branch 'develop' into feature/multistream-training
persiyanov Aug 15, 2018
1fdaa43
add tests for doc2vec file-based + rename MULTISTREAM -> CORPUSFILE e…
Aug 15, 2018
c2fa0d8
regenerated .c + .cpp files
Aug 15, 2018
5427416
add Word2VecConfig in order to remove repeating parts of code
Aug 15, 2018
7f7760b
make shared initialization
Aug 15, 2018
926fd5e
use init_config from word2vec_corpusfile
Aug 15, 2018
df47983
add FastTextConfig
Aug 15, 2018
0df7f6f
init_config -> init_w2v_config, init_ft_config
Aug 15, 2018
5fd1c99
regenerated .c & .cpp files
Aug 15, 2018
d9257be
using FastTextConfig in fasttext_corpusfile.pyx
Aug 15, 2018
67c572c
fix
Aug 15, 2018
8e82b9f
fix
Aug 15, 2018
db2a77f
fix next_random in w2v
Aug 15, 2018
a96bc6d
introduce Doc2VecConfig
Aug 16, 2018
3b4da64
fix init_d2v_config
Aug 16, 2018
53b967c
use Doc2VecConfig in doc2vec_corpusfile.pyx
Aug 16, 2018
f57d1cb
removed unused vars
Aug 16, 2018
b652afe
fix docstrings
Aug 16, 2018
260cfb5
fix more docstrings
Aug 16, 2018
a433018
test old model for doc2vec & fasttext
Aug 16, 2018
20ec49b
fix loading old models
Aug 16, 2018
1ced17d
fix fasttext model checking
Aug 16, 2018
0731449
merge fast_line_sentence.cpp and fast_line_sentence.h
Aug 16, 2018
35f0ab4
fix word2vec test
Aug 16, 2018
49905f0
fix syntax error
Aug 16, 2018
95c6ec9
remove redundanta seekg call
Aug 16, 2018
aed2b6b
fix example notebook
Aug 16, 2018
c1af621
add initial doc_tags computation
Aug 16, 2018
33bf97a
fix test
Aug 16, 2018
e592b6a
fix test for windows
Aug 17, 2018
d08e4c1
add one more test on offsets
Aug 17, 2018
468a000
get rid of subword_arrays in fasttext
Aug 17, 2018
f71e1f8
make hanging indents everywhere
Aug 17, 2018
811388b
open file in byte mode
Aug 18, 2018
ddd5901
fix pep
Aug 18, 2018
a3490c7
fix tests
Aug 18, 2018
a28ff0d
fix again
Aug 18, 2018
b2996f0
final fix?
Aug 18, 2018
64bb617
regenerated .c & .cpp files
Aug 18, 2018
816f63f
fix test_persistence_fromfile for FastText
Aug 18, 2018
abad1b8
add fasttext & doc2vec to notebook
Aug 20, 2018
0b03839
add short examples
Aug 20, 2018
6217c73
update file-based tutorial notebook
piskvorky Aug 23, 2018
f70d159
work credit + minor nb fixes
piskvorky Aug 25, 2018
9593d5f
remove FIXMEs from file-based *2vec notebook
piskvorky Sep 9, 2018
7b714b2
remove warnings in corpus_file mode
persiyanov Sep 9, 2018
b833f0f
fix deprecation warning
menshikh-iv Sep 12, 2018
bcc0fb9
regenerate .ipynb
persiyanov Sep 14, 2018
384e0b1
upd plot
persiyanov Sep 14, 2018
527266f
upd plot
persiyanov Sep 14, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 57 additions & 15 deletions gensim/models/base_any2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@
from types import GeneratorType
from gensim.utils import deprecated
import warnings
import itertools

try:
from queue import Queue
Expand Down Expand Up @@ -123,6 +122,9 @@ def _clear_post_train(self):
"""Resets certain properties of the model post training. eg. `keyedvectors.vectors_norm`."""
raise NotImplementedError()

def _do_train_epoch(self, input_stream, thread_private_mem, cur_epoch, total_examples=None, total_words=None):
raise NotImplementedError()

def _do_train_job(self, data_iterable, job_parameters, thread_private_mem):
"""Train a single batch. Return 2-tuple `(effective word count, total word count)`."""
raise NotImplementedError()
Expand All @@ -136,6 +138,16 @@ def _check_input_data_sanity(self, data_iterable=None, data_iterables=None):
if not ((data_iterable is not None) ^ (data_iterables is not None)):
raise ValueError("You must provide only one of singlestream or multistream arguments.")

def _worker_loop_multistream(self, input_stream, progress_queue, cur_epoch=0,
total_examples=None, total_words=None):
thread_private_mem = self._get_thread_working_mem()

examples, tally, raw_tally = self._do_train_epoch(input_stream, thread_private_mem, cur_epoch,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please (here and everywhere else).

total_examples=total_examples, total_words=total_words)

progress_queue.put((examples, tally, raw_tally))
progress_queue.put(None)

def _worker_loop(self, job_queue, progress_queue):
"""Train the model, lifting batches of data from the queue.

Expand Down Expand Up @@ -258,8 +270,8 @@ def _log_epoch_end(self, cur_epoch, example_count, total_examples, raw_word_coun
def _log_train_end(self, raw_word_count, trained_word_count, total_elapsed, job_tally):
raise NotImplementedError()

def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_examples=None, total_words=None,
report_delay=1.0):
def _log_epoch_progress(self, progress_queue=None, job_queue=None, cur_epoch=0, total_examples=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because in multistream mode there is no job_queue, so I made these arguments optional.

total_words=None, report_delay=1.0):
"""Get the progress report for a single training epoch.

Parameters
Expand Down Expand Up @@ -328,8 +340,32 @@ def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_exam
self.total_train_time += elapsed
return trained_word_count, raw_word_count, job_tally

def _train_epoch(self, data_iterable=None, data_iterables=None, cur_epoch=0, total_examples=None,
total_words=None, queue_factor=2, report_delay=1.0):
def _train_epoch_multistream(self, data_iterables, cur_epoch=0, total_examples=None, total_words=None):
assert len(data_iterables) == self.workers, "You have to pass the same amount of input streams as workers, " \
Copy link
Owner

@piskvorky piskvorky Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is for checking programmer errors (code invariants), not user input. Exception better.

"because each worker gets its own independent input stream."

progress_queue = Queue()

workers = [
threading.Thread(
target=self._worker_loop_multistream,
args=(input_stream, progress_queue,),
kwargs={'cur_epoch': cur_epoch, 'total_examples': total_examples, 'total_words': total_words}
) for input_stream in data_iterables
]

for thread in workers:
thread.daemon = True
thread.start()

trained_word_count, raw_word_count, job_tally = self._log_epoch_progress(
progress_queue=progress_queue, job_queue=None, cur_epoch=cur_epoch, total_examples=total_examples,
total_words=total_words)

return trained_word_count, raw_word_count, job_tally

def _train_epoch(self, data_iterable, cur_epoch=0, total_examples=None, total_words=None,
queue_factor=2, report_delay=1.0):
"""Train the model for a single epoch.

Parameters
Expand Down Expand Up @@ -361,7 +397,6 @@ def _train_epoch(self, data_iterable=None, data_iterables=None, cur_epoch=0, tot
* Total word count used in training.

"""
self._check_input_data_sanity(data_iterable, data_iterables)
job_queue = Queue(maxsize=queue_factor * self.workers)
progress_queue = Queue(maxsize=(queue_factor + 1) * self.workers)

Expand All @@ -372,9 +407,6 @@ def _train_epoch(self, data_iterable=None, data_iterables=None, cur_epoch=0, tot
for _ in xrange(self.workers)
]

# Chain all input streams into one, because multistream training is not supported yet.
if data_iterables is not None:
data_iterable = itertools.chain(*data_iterables)
workers.append(threading.Thread(
target=self._job_producer,
args=(data_iterable, job_queue),
Expand Down Expand Up @@ -444,10 +476,14 @@ def train(self, data_iterable=None, data_iterables=None, epochs=None, total_exam
for callback in self.callbacks:
callback.on_epoch_begin(self)

trained_word_count_epoch, raw_word_count_epoch, job_tally_epoch = self._train_epoch(
data_iterable=data_iterable, data_iterables=data_iterables, cur_epoch=cur_epoch,
total_examples=total_examples, total_words=total_words, queue_factor=queue_factor,
report_delay=report_delay)
if data_iterable is not None:
trained_word_count_epoch, raw_word_count_epoch, job_tally_epoch = self._train_epoch(
data_iterable, cur_epoch=cur_epoch, total_examples=total_examples,
total_words=total_words, queue_factor=queue_factor, report_delay=report_delay)
else:
trained_word_count_epoch, raw_word_count_epoch, job_tally_epoch = self._train_epoch_multistream(
data_iterables, cur_epoch=cur_epoch, total_examples=total_examples, total_words=total_words)

trained_word_count += trained_word_count_epoch
raw_word_count += raw_word_count_epoch
job_tally += job_tally_epoch
Expand Down Expand Up @@ -550,6 +586,9 @@ def __init__(self, sentences=None, input_streams=None, workers=3, vector_size=10
consider an iterable that streams the sentences directly from disk/network.
See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
or :class:`~gensim.models.word2vec.LineSentence` for such examples.
input_streams : list or tuple of iterable of iterables
The tuple or list of `sentences`-like arguments. Use it if you have multiple input streams. It is possible
to process streams in parallel, using `workers` parameter.
workers : int, optional
Number of working threads, used for multiprocessing.
vector_size : int, optional
Expand Down Expand Up @@ -928,6 +967,9 @@ def train(self, sentences=None, input_streams=None, total_examples=None, total_w
consider an iterable that streams the sentences directly from disk/network.
See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
or :class:`~gensim.models.word2vec.LineSentence` module for such examples.
input_streams : list or tuple of iterable of iterables
The tuple or list of `sentences`-like arguments. Use it if you have multiple input streams. It is possible
to process streams in parallel, using `workers` parameter.
total_examples : int, optional
Count of sentences.
total_words : int, optional
Expand Down Expand Up @@ -1181,14 +1223,14 @@ def _log_progress(self, job_queue, progress_queue, cur_epoch, example_count, tot
logger.info(
"EPOCH %i - PROGRESS: at %.2f%% examples, %.0f words/s, in_qsize %i, out_qsize %i",
cur_epoch + 1, 100.0 * example_count / total_examples, trained_word_count / elapsed,
utils.qsize(job_queue), utils.qsize(progress_queue)
None if job_queue is None else utils.qsize(job_queue), utils.qsize(progress_queue)
)
else:
# words-based progress %
logger.info(
"EPOCH %i - PROGRESS: at %.2f%% words, %.0f words/s, in_qsize %i, out_qsize %i",
cur_epoch + 1, 100.0 * raw_word_count / total_words, trained_word_count / elapsed,
utils.qsize(job_queue), utils.qsize(progress_queue)
None if job_queue is None else utils.qsize(job_queue), utils.qsize(progress_queue)
)

def _log_epoch_end(self, cur_epoch, example_count, total_examples, raw_word_count, total_words,
Expand Down
25 changes: 25 additions & 0 deletions gensim/models/fast_line_sentence.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#include <stdexcept>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

#include "fast_line_sentence.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about glue .cpp and .h to one file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



FastLineSentence::FastLineSentence() : is_eof_(false) { }
FastLineSentence::FastLineSentence(const std::string& filename) : filename_(filename), fs_(filename), is_eof_(false) { }

std::vector<std::string> FastLineSentence::ReadSentence() {
if (is_eof_) {
return {};
}
std::string line, word;
std::getline(fs_, line);
std::vector<std::string> res;

std::istringstream iss(line);
while (iss >> word) {
res.push_back(word);
}

if (fs_.eof()) {
is_eof_ = true;
}
return res;
}
21 changes: 21 additions & 0 deletions gensim/models/fast_line_sentence.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#pragma once

#include <fstream>
#include <sstream>
#include <vector>


class FastLineSentence {
public:
explicit FastLineSentence();
explicit FastLineSentence(const std::string& filename);

std::vector<std::string> ReadSentence();
inline bool IsEof() const { return is_eof_; }
inline void Reset() { fs_ = std::ifstream(filename_); is_eof_ = false; }

private:
std::string filename_;
std::ifstream fs_;
bool is_eof_;
};
13 changes: 13 additions & 0 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@

try:
from gensim.models.word2vec_inner import train_batch_sg, train_batch_cbow
from gensim.models.word2vec_inner import train_epoch_sg, train_epoch_cbow
from gensim.models.word2vec_inner import score_sentence_sg, score_sentence_cbow
from gensim.models.word2vec_inner import FAST_VERSION, MAX_WORDS_IN_BATCH

Expand Down Expand Up @@ -752,6 +753,18 @@ def __init__(self, sentences=None, input_streams=None, size=100, alpha=0.025, wi
seed=seed, hs=hs, negative=negative, cbow_mean=cbow_mean, min_alpha=min_alpha, compute_loss=compute_loss,
fast_version=FAST_VERSION)

def _do_train_epoch(self, input_stream, thread_private_mem, cur_epoch, total_examples=None, total_words=None):
work, neu1 = thread_private_mem

if self.sg:
examples, tally, raw_tally = train_epoch_sg(self, input_stream, cur_epoch, total_examples, total_words,
work, neu1, self.compute_loss)
else:
examples, tally, raw_tally = train_epoch_cbow(self, input_stream, cur_epoch, total_examples, total_words,
work, neu1, self.compute_loss)

return examples, tally, raw_tally

def _do_train_job(self, sentences, alpha, inits):
"""Train the model on a single batch of sentences.

Expand Down
Loading