piskvorky · menshikh-iv · Sep 14, 2018 · Jul 9, 2018 · Jul 9, 2018 · Jul 9, 2018
diff --git a/docs/notebooks/Word2Vec_Multistream.ipynb b/docs/notebooks/Word2Vec_Multistream.ipynb
@@ -0,0 +1,235 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Word2Vec Multistream API Tutorial\n",
+    "\n",
+    "This tutorial introduces new **`corpus_file`** argument for **`gensim.models.word2vec.Word2Vec`** model and how to use it. \n",
+    "\n",
+    "## Motivation\n",
+    "\n",
+    "Because standard Word2Vec training with `sentences` argument doesn't scale so well when the number of workers is large, special `corpus_file` argument was added to tackle this problem. Training with `corpus_file` yields **significant performance boost** (training is 370% faster with 32 workers in comparison to training with `sentences` argument). Also, it outruns performance of original Word2Vec C tool in terms of words/sec processing speed.\n",
+    "\n",
+    "While providing such benefits in performance, `corpus_file` argument accepts path to your corpus file which must be in a format of `gensim.models.word2vec.LineSentence` (one sentence per line, words are separated by whitespaces).\n",
+    "\n",
+    "\n",
+    "**Note**: you have to build `gensim` with Cython optimizations (`gensim.models.word2vec.MULTISTREAM_VERSION >= 0`) in order to be able to use `corpus_file` argument."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## In this tutorial\n",
+    "\n",
+    "* We will show how to use the new API.\n",
+    "* We compare performance of `corpus_file` vs. `sentences` arguments on English Wikipedia.\n",
+    "* We will show that accuracies on `question-words.txt` are almost the same for both modes."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Usage is really simple\n",
+    "\n",
+    "You only need:\n",
+    "\n",
+    "1. Save your corpus in LineSentence format (you may use `gensim.utils.save_as_line_sentence(your_corpus, your_corpus_file)` to save your corpus).\n",
+    "2. Change `sentences=your_corpus` argument to `corpus_file=your_corpus_file` in `Word2Vec.__init__`, `Word2Vec.build_vocab`, `Word2Vec.train` calls.\n",
+    "\n",
+    "\n",
+    "Short example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gensim.downloader as api\n",
+    "from gensim.utils import save_as_line_sentence\n",
+    "from gensim.models.word2vec import Word2Vec\n",
+    "\n",
+    "corpus = api.load(\"text8\")\n",
+    "\n",
+    "save_as_line_sentence(corpus, \"my_corpus.txt\")\n",
+    "\n",
+    "model = Word2Vec(corpus_file=\"my_corpus.txt\", iter=5, size=300, workers=14)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Let's prepare Wikipedia dataset\n",
+    "\n",
+    "We load wikipedia dump from `gensim-data`, perform preprocessing with gensim functions and save processed corpus in LineSentence format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CORPUS_FILE = 'wiki-en-20171001.txt'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import itertools\n",
+    "from gensim.parsing.preprocessing import preprocess_string\n",
+    "\n",
+    "def processed_corpus():\n",
+    "    raw_corpus = api.load('wiki-english-20171001')\n",
+    "    for article in raw_corpus:\n",
+    "        doc = '\\n'.join(itertools.chain.from_iterable(zip(article['section_titles'], article['section_texts'])))\n",
+    "        yield preprocess_string(doc)        \n",
+    "\n",
+    "save_as_line_sentence(processed_corpus(), CORPUS_FILE)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train Word2Vec\n",
+    "\n",
+    "We train two models:\n",
+    "* With `sentences` argument\n",
+    "* With `corpus_file` argument\n",
+    "\n",
+    "\n",
+    "Then, we compare timings and accuracy on `question-words.txt`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from gensim.models.word2vec import LineSentence\n",
+    "import time\n",
+    "\n",
+    "st_time = time.time()\n",
+    "model_sent = Word2Vec(sentences=LineSentence(CORPUS_FILE), iter=5, size=300, workers=32)\n",
+    "model_sent_training_time = time.time() - st_time\n",
+    "\n",
+    "st_time = time.time()\n",
+    "model_corp_file = Word2Vec(corpus_file=CORPUS_FILE, iter=5, size=300, workers=32)\n",
+    "model_corp_file_training_time = time.time() - st_time"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training model with `sentences` took 8711.613\n",
+      "Training model with `corpus_file` took 2367.976 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Training model with `sentences` took {:.3f}\".format(model_sent_training_time))\n",
+    "print(\"Training model with `corpus_file` took {:.3f} seconds\".format(model_corp_file_training_time))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Training with `corpus_file` took 3.7x less time!\n",
+    "\n",
+    "#### Now, let's compare the accuracies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from gensim.test.utils import datapath"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/persiyanov/gensim/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
+      "  if np.issubdtype(vec.dtype, np.int):\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Word analogy accuracy with `sentences`: 0.754\n",
+      "Word analogy accuracy with `corpus_file`: 0.744\n"
+     ]
+    }
+   ],
+   "source": [
+    "model_sent_accuracy = model_sent.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n",
+    "print(\"Word analogy accuracy with `sentences`: {:.3f}\".format(model_sent_accuracy))\n",
+    "\n",
+    "model_corp_file_accuracy = model_corp_file.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]\n",
+    "print(\"Word analogy accuracy with `corpus_file`: {:.3f}\".format(model_corp_file_accuracy))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Accuracies are approximately the same."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}