Refactor documentation for `*2Vec` models #1944

steremma · 2018-03-01T11:52:39Z

In the end this PR will fix the documentation for BaseWordEmbeddingsModel, BaseAny2Vec, Word2Vec and possible Doc2Vec. At the moment of opening only the first is done, however its useful to keep it online to get feedback as early as possible

menshikh-iv · 2018-03-01T13:06:59Z

gensim/models/base_any2vec.py

@@ -288,18 +281,67 @@ class BaseWordEmbeddingsModel(BaseAny2VecModel):

    """

-    def _clear_post_train(self):


Please don't change the code in current PR, only docstrings.

This reverts commit feb3c32.

menshikh-iv · 2018-03-05T05:42:30Z

Great start @steremma, please continue!

somnathrakshit · 2018-03-05T16:40:38Z

Hey @steremma can I help you with the documentation? I have been working on the same.

steremma · 2018-03-05T17:16:11Z

Hello @somnathrakshit

I believe the PR is almost finished, however I am unsure about certain argument in some of the helper methods, specifically those that are conditionally defined in word2vec and doc2vec when the cython implementation is missing.

It would be nice if you could take a look at the type and description I gave for these arguments and let me know if you disagree or if you can improve any of them.

For more specific info feel free to ping me in gitter so that we don't spam this discussion.

Another way to cooperate might be for you to submit PRs to my fork directly instead of telling me what to change.I think in this way, after merging you will get credit for your contribution as well.
@menshikh-iv is that correct?

menshikh-iv · 2018-03-06T17:12:02Z

@CLearERR @anotherbugmaster @yurkai - guys, please discuss with @steremma how to make docstring guideline better (we need our contribution documentation guide)

menshikh-iv · 2018-03-07T04:41:21Z

gensim/models/base_any2vec.py

-    """
-    Base class containing common methods for training, using & evaluating word embeddings learning models.
-    For example - `Word2Vec`, `FastText`, etc.
+    """Base class containing common methods for training, using & evaluating word embeddings learning models.


Don't forget about docstrings for BaseAny2VecModel.

Also, don't forget about _ methods, this is really important for persons, who will add some *2vec implementations in future.

menshikh-iv · 2018-03-07T04:43:30Z

gensim/models/base_any2vec.py

+            Can be simply a list of lists of tokens, but for larger corpora,
+            consider an iterable that streams the sentences directly from disk/network.
+            See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
+            or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.


No need to mention :mod: here (because you already mentioned concrete class)

menshikh-iv · 2018-03-07T04:43:50Z

gensim/models/base_any2vec.py

+            consider an iterable that streams the sentences directly from disk/network.
+            See :class:`~gensim.models.word2vec.BrownCorpus`, :class:`~gensim.models.word2vec.Text8Corpus`
+            or :class:`~gensim.models.word2vec.LineSentence` in :mod:`~gensim.models.word2vec` module for such examples.
+        workers : int


almost all arguments is optional, no?

menshikh-iv · 2018-03-07T04:44:05Z

gensim/models/base_any2vec.py

+
+        Parameters
+        ----------
+        sentences : iterable of iterable of str


iterable of list of str better (here and everywhere)

menshikh-iv · 2018-03-07T04:45:58Z

gensim/models/base_any2vec.py

+            List of callbacks that need to be executed/run at specific stages during training.
+        batch_words : int
+            Number of words to be processed by a single job.
+        trim_rule : function, optional


Need to describe function signature (in the description of the parameter). I see (word, count, min_count), but I need types here + show it more explicitly (maybe new sentence)

menshikh-iv · 2018-03-07T04:47:33Z

gensim/models/base_any2vec.py

-        keep_raw_vocab : bool
-            If not true, delete the raw vocabulary after the scaling is done and free up RAM.
-        corpus_count : int
+        word_freq : dict of (unicode str, int)


dict of (str, int)

menshikh-iv · 2018-03-07T04:48:01Z

gensim/models/base_any2vec.py

+
+        Returns
+        -------
+        dict of (str, int), optional


How Returns can be optional?

menshikh-iv · 2018-03-07T04:48:48Z

gensim/models/word2vec.py

+Examples
+--------
+
+#. Initialize a model with e.g.::


Don't forget to fix example (this should works fine + demonstrate more methods)

menshikh-iv · 2018-03-07T04:49:33Z

gensim/models/word2vec.py

-        Obtain likelihood score for a single sentence in a fitted skip-gram representaion.
-        The sentence is a list of Vocab objects (or None, when the corresponding
-        word is not in the vocabulary). Called internally from `Word2Vec.score()`.
+        Obtain likelihood score for a single sentence in a fitted skip-gram representation.


should be

"""Obtain ...

here and everywhere

menshikh-iv · 2018-03-07T04:51:40Z

gensim/models/doc2vec.py

@@ -512,32 +648,49 @@ def train(self, documents, total_examples=None, total_words=None,
            queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)

    def _raw_word_count(self, job):
-        """Return the number of words in a given job."""
+        """Return the number of words in a given job.


Try to use always Get instead of Return in docstring (here and everywhere)

menshikh-iv

In general - looks awesome! Good work 👍

Missed things (for current moment)

Examples for d2v/w2v
Fasttext
Poincare
Docstrings for .pyx files (this will be done in last order, we'll discuss it later)

menshikh-iv · 2018-03-14T05:17:27Z

gensim/models/base_any2vec.py

@@ -28,18 +28,40 @@

 class BaseAny2VecModel(utils.SaveLoad):
    """Base class for training, using and evaluating any2vec model.
-    Contains implementation for multi-threaded training.
+


Fix description in the head of current file please (make it more detailed, links to several related classes, etc).

menshikh-iv · 2018-03-14T05:17:56Z

gensim/models/base_any2vec.py

        A subclass should initialize the following attributes:
        - self.kv (instance of concrete implementation of `BaseKeyedVectors` interface)
        - self.vocabulary (instance of concrete implementation of `BaseVocabBuilder` abstract class)
        - self.trainables (instance of concrete implementation of `BaseTrainables` abstract class)

+        Parameters
+        ----------
+        workers : int


optional parameters (here and everywhere)

menshikh-iv · 2018-03-14T05:18:11Z

gensim/models/base_any2vec.py

+        Parameters
+        ----------
+        workers : int
+            Number of working threads, used for multiprocessing.


multithreading :)

menshikh-iv · 2018-03-14T05:19:11Z

gensim/models/base_any2vec.py

+        ----------
+        job_queue : Queue of (list of object, dict)
+            A queue of jobs still to be processed. The worker will take up jobs from this queue.
+            Each job is represented by a tuple where the first element is the corpus chunk to be processed and


better to add "toy" example, how looks element of queue

menshikh-iv · 2018-03-14T05:20:48Z

gensim/models/base_any2vec.py

+        Parameters
+        ----------
+        data_iterator : iterable of list of object
+            The input corpus. This will be split in chunks and these chunks will be pushed to the queue.


not always corpus

menshikh-iv · 2018-03-14T05:21:01Z

gensim/models/base_any2vec.py

+        ----------
+        data_iterator : iterable of list of object
+            The input corpus. This will be split in chunks and these chunks will be pushed to the queue.
+        job_queue : Queue of (list of object, dict)


dict of ? here and everywhere

menshikh-iv · 2018-03-14T05:22:32Z

gensim/models/base_any2vec.py

+            Multiplier for size of queue -> size = number of workers * queue_factor.
+        report_delay : float, optional
+            Number of seconds between two consecutive progress report messages in the logger.
+        callbacks : list of :class: `~gensim.models.callbacks.CallbackAny2Vec`, optional


No need to add space between :class: and link to model, please check all in rendered documentation (how this looks), this is important

menshikh-iv · 2018-03-14T05:24:34Z

gensim/models/base_any2vec.py

@@ -526,7 +806,7 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No
            len(raw_vocab), sum(itervalues(raw_vocab))
        )

-        # Since no sentences are provided, this is to control the corpus_count
+        # Since no sentences are provided, this is to control the `corpus_count`


this have no sense to use in comments (that presented as# ... `)

menshikh-iv · 2018-03-14T05:25:24Z

gensim/models/base_any2vec.py

+        Returns
+        -------
+        (np.ndarray, np.ndarray)
+            Each worker threads private work memory.


don't forget about empty line at the end of each docstring (except one-line docstrings), here and everywhere

menshikh-iv · 2018-03-14T05:26:50Z

gensim/models/word2vec.py

+        ----------
+        model : :class:`~gensim.models.word2Vec.Word2Vec`
+            The Word2Vec model instance to train.
+        sentences : iterable of iterable of str


iterable of list of str better (here and everywhere)

menshikh-iv · 2018-03-14T05:31:27Z

@steremma If you'll finish with all mentioned stuff, some hints about docstrings for .pyx

Add special flag to head of file like here https://github.com/RaRe-Technologies/gensim/blob/985d552a1ca37b96629cbc1037100fa0a21f5ba5/gensim/_matutils.pyx#L3
Write docstrings for all methods in .pyx files
cython path/to/your/file, please use cython>=0.27 for this proposes
Add needed *.rst file for this file
tox -e docs (also, you need to add it to apiref.rst of course)

…ome intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory.

menshikh-iv

Awesome work @steremma, you are one of the best persons who worked on documentation 🔥

So, what're additional things should be done before a merge

More examples (wider for module + examples in concrete methods)
Cover .pyx files (instruction - Refactor documentation for *2Vec models #1944 (comment))

menshikh-iv · 2018-03-29T14:47:09Z

gensim/models/base_any2vec.py

+-----
+Even though this is the usual case, not all embeddings transform text.
+For example :class:`~gensim.models.poincare.PoincareModel` operates on graph representations.
+


Add See also section here and mention several concrete implementations as w2v, fasttext, etc.

menshikh-iv · 2018-03-29T14:47:38Z

gensim/models/base_any2vec.py


    """

    def __init__(self, workers=3, vector_size=100, epochs=5, callbacks=(), batch_words=10000):
        """Initialize model parameters.

+        Notes


Better place for this notes is below (in class docstring instead of __init__ docstring)

menshikh-iv · 2018-03-29T14:49:10Z

gensim/models/base_any2vec.py

+        ------
+        IOError
+            When methods are called on instance (should be called from class).
+        """


missing emptyline at the end of docstring (here and everywhere)

menshikh-iv · 2018-03-29T14:51:32Z

gensim/models/base_any2vec.py

+        Parameters
+        ----------
+        job_params : dict of (str, obj)
+            Unused. TODO: Delete this.


You can write something like UNUSED. (without TODO)

menshikh-iv · 2018-03-29T14:52:13Z

gensim/models/doc2vec.py


+    >>> from gensim.test.utils import common_texts


No need to add \t before >>> (here and everywhere)

menshikh-iv · 2018-03-29T14:59:22Z

gensim/models/doc2vec.py

        return 60 * len(self.docvecs.offset2doctag) + 140 * len(self.docvecs.doctags)

    def infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5):
-        """
-        Infer a vector for given post-bulk training document.
+        """Infer a vector for given post-bulk training document.


Add a warning about "this infer new vectors each time, it's fine that you retrieve different vectors for the same document, increate steps for more stable representations (and similar but short tip for steps parameter)

menshikh-iv · 2018-03-29T15:00:52Z

gensim/models/fasttext.py

@@ -45,6 +81,7 @@
 logger = logging.getLogger(__name__)

 try:
+    raise ImportError


I wanted to make sure that doctests work even without numpy installed so I did this hack to force my computer to use the python implementation. And then i forgot to remove it!

menshikh-iv · 2018-03-29T15:01:43Z

gensim/models/fasttext.py

@@ -519,20 +601,31 @@ def train(self, sentences, total_examples=None, total_words=None,
        self.trainables.get_vocab_word_vecs(self.wv)

    def init_sims(self, replace=False):
-        """
+        """Deletes the keyed vector syn1 structure.


menshikh-iv · 2018-03-29T15:02:57Z

gensim/models/poincare.py

@@ -73,22 +74,32 @@ class PoincareModel(utils.SaveLoad):
    and :meth:`~gensim.models.poincare.PoincareModel.load` methods, or stored/loaded in the word2vec format
    via `model.kv.save_word2vec_format` and :meth:`~gensim.models.poincare.PoincareKeyedVectors.load_word2vec_format`.

-    Note that training cannot be resumed from a model loaded via `load_word2vec_format`, if you wish to train further,
+    Notes


What's about examples here?

We have example for each method in the PoincareKeyedVectors class.

menshikh-iv · 2018-03-29T15:03:33Z

gensim/models/poincare.py

-        """Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`."""
+        """Load model from disk, inherited from :class:`~gensim.utils.SaveLoad`.
+
+        See also :meth:`~gensim.models.poincare.PoincareModel.save`


Replace all "See also" to section

See also ------------ ...

here and everywhere

piskvorky · 2018-06-14T22:04:59Z

PR continued in #2087.

* Remove useless methods * started working on docstrings * more work done * Finished documentation for the `BaseWordEmbeddingsModel * PEP-8 * Revert "Remove useless methods" This reverts commit feb3c32. * added documentation for the class and all its helper methods * remove duplicated type info * Added documentation for `Doc2vec` model and all its helper methods * Fixed paper references and added documentation for �Doc2VecVocab * Fixed paper references * minor referencing fixes * sphinx identation * Added docstrings for the private methods in `BaseAny2Vec` * Applied all code review corrections, example fix still pending * Added missing docstrings * Fixed `int {1, 0}` -> `{1, 0}` * Fixed examples and code review corrections * Fixed examples and applied code review corrections (optional arguments, correct types, blank lines at end of docstrings * Applied code review corrections and added top level usage examples * Added high level explanation of the class hierarchy, fixed code review corrections * Final identation fixes * Documentation fixes * Fixed all examples * delete redundant reference to module * Added explanation for all important class attributes. These include some intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory. * documented public cython functions * documented public cython functions in doc2vec * Applied code review corrections * added documentation for public cython methods in `fasttext` * added documentation for C functions in the word2vec * fix build issues * add missing rst * fix base_any2vec * fix doc2vec[1] * fix doc2vec[2] * fix doc2vec[3ъ * fix doc2vec[4] * fix doc2vec_inner + remove unused imports * fix fasttext[1] * reformat example sections * word2vec doc fixes * more doc fixes * merging in changes from #1944 * review docs for doc2vec, base_any2vec * review fasttext docs * review poincare docs * minor typo fixes * simplify word2vec.train() docs * update alpha & epoch docs for *2vec models * add *_inner docs * fixing KeyedVectors docs * disable sphinx latex and errors (temporary, revert later) * hyperlink fixes * Fix build warnings * fix flake8 * enable strict doc building * embedsignature for w2v & ft * yes/no -> ✅/❌ * cleanup base_any2vec * clenup cython files * cleanup doc2vec * improve d2v example * cleanup fasttext * clenup utils_any2vec * clenup poincare * clenup keyedvectors * cleanup word2vec * add newline around module docstrings + re-generate *.c files (for correct doc building)

steremma and others added 5 commits February 28, 2018 16:11

Remove useless methods

feb3c32

started working on docstrings

52eb1b3

more work done

cb7b71a

Finished documentation for the `BaseWordEmbeddingsModel

347cdb0

PEP-8

327afc5

menshikh-iv reviewed Mar 1, 2018

View reviewed changes

Revert "Remove useless methods"

bb8e3a3

This reverts commit feb3c32.

added documentation for the class and all its helper methods

7e89ca9

menshikh-iv mentioned this pull request Mar 5, 2018

Update word2vec model docstring to numpy-style #1923

Closed

steremma added 2 commits March 5, 2018 16:32

remove duplicated type info

e0fe665

Added documentation for Doc2vec model and all its helper methods

8aa85bc

steremma added 4 commits March 6, 2018 11:20

Fixed paper references and added documentation for �Doc2VecVocab

7c74a4c

Fixed paper references

e92b9b4

minor referencing fixes

9093eab

sphinx identation

c07afa4

menshikh-iv suggested changes Mar 7, 2018

View reviewed changes

steremma and others added 5 commits March 8, 2018 10:29

Added docstrings for the private methods in BaseAny2Vec

4a14a3e

Applied all code review corrections, example fix still pending

a7f3f0e

Added missing docstrings

69d524d

Fixed int {1, 0} -> {1, 0}

4707c37

Fixed examples and code review corrections

3a85ac5

menshikh-iv changed the title ~~[DNM] Document any2vec~~ Refactor documentation for *2Vec models. Mar 14, 2018

menshikh-iv changed the title ~~Refactor documentation for *2Vec models.~~ Refactor documentation for *2Vec models Mar 14, 2018

menshikh-iv added the incubator project PR is RaRe incubator project label Mar 14, 2018

menshikh-iv suggested changes Mar 14, 2018

View reviewed changes

steremma and others added 4 commits March 20, 2018 11:44

delete redundant reference to module

7cb408c

Added explanation for all important class attributes. These include s…

5b6d815

…ome intuitive information taken from the papers but also references to usage examples for users that do not wish to understand the underlying theory.

documented public cython functions

f58e9a2

documented public cython functions in doc2vec

6570cef

menshikh-iv suggested changes Mar 29, 2018

View reviewed changes

steremma added 3 commits March 30, 2018 16:39

Applied code review corrections

0e8d299

added documentation for public cython methods in fasttext

86a6d23

added documentation for C functions in the word2vec

dc2f93e

menshikh-iv added the RFM label Apr 3, 2018

menshikh-iv added 7 commits April 11, 2018 17:45

fix build issues

f78348f

add missing rst

cec8c44

fix base_any2vec

585f81f

fix doc2vec[1]

b5d84ff

fix doc2vec[2]

6f32e78

fix doc2vec[3ъ

2e3a0b7

Merge branch 'develop' into document-any2vec

297b48e

menshikh-iv mentioned this pull request Apr 16, 2018

Add Sent2Vec model. Fix #1376 #1619

Closed

menshikh-iv added 4 commits April 18, 2018 17:33

fix doc2vec[4]

2d9616c

fix doc2vec_inner + remove unused imports

2fcd2f1

fix fasttext[1]

7cbbac9

reformat example sections

0e9e6c5

menshikh-iv mentioned this pull request Apr 24, 2018

Documentation on jointly learning feature representations with a higher task #2036

Closed

piskvorky mentioned this pull request Apr 30, 2018

Documentation fixes #2037

Open

piskvorky added a commit that referenced this pull request Jun 10, 2018

merging in changes from #1944

4d99889

piskvorky mentioned this pull request Jun 10, 2018

Fix documentation for *2vec models #2087

Merged

piskvorky closed this Jun 14, 2018

		@@ -288,18 +281,67 @@ class BaseWordEmbeddingsModel(BaseAny2VecModel):

		"""

		def _clear_post_train(self):

Refactor documentation for *2Vec models #1944

Refactor documentation for *2Vec models #1944

Conversation

steremma commented Mar 1, 2018

Choose a reason for hiding this comment

menshikh-iv commented Mar 5, 2018

somnathrakshit commented Mar 5, 2018

steremma commented Mar 5, 2018 • edited Loading

menshikh-iv commented Mar 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Mar 14, 2018 • edited Loading

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jun 14, 2018

Refactor documentation for `*2Vec` models #1944

Refactor documentation for `*2Vec` models #1944

steremma commented Mar 5, 2018 •

edited

Loading

menshikh-iv commented Mar 14, 2018 •

edited

Loading