Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix documentation for gensim.corpora. Partial fix #1671 #1729

Merged
merged 54 commits into from
Jan 22, 2018
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b260d4b
Fix typo
anotherbugmaster Sep 30, 2017
36d98d1
Make `save_corpus` private
anotherbugmaster Oct 2, 2017
981ebbb
Annotate `bleicorpus.py`
anotherbugmaster Oct 2, 2017
3428113
Make __save_corpus weakly private
anotherbugmaster Oct 2, 2017
69fc7e0
Fix _save_corpus in tests
anotherbugmaster Oct 2, 2017
b65a69a
Fix _save_corpus[2]
anotherbugmaster Oct 3, 2017
6fa92f3
Merge remote-tracking branch 'upstream/develop' into develop
anotherbugmaster Oct 15, 2017
78e207d
Document bleicorpus in Numpy style
anotherbugmaster Oct 24, 2017
7519382
Document indexedcorpus
anotherbugmaster Oct 24, 2017
ae69867
Annotate csvcorpus
anotherbugmaster Nov 3, 2017
c2765ed
Add "Yields" section
anotherbugmaster Nov 3, 2017
40add21
Make `_save_corpus` public
anotherbugmaster Nov 3, 2017
e044c3a
Annotate bleicorpus
anotherbugmaster Nov 3, 2017
123327d
Fix indentation in bleicorpus
anotherbugmaster Nov 3, 2017
2382d01
`_save_corpus` -> `save_corpus`
anotherbugmaster Nov 21, 2017
42409bf
Annotate bleicorpus
anotherbugmaster Nov 21, 2017
7cb5bbf
Convert dictionary docs to numpy style
anotherbugmaster Nov 21, 2017
56f19e6
Convert hashdictionary docs to numpy style
anotherbugmaster Nov 21, 2017
9162a7e
Convert indexedcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
5eaaac4
Convert lowcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
3b6b076
Convert malletcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
d7f3fc8
Convert mmcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
c46bff4
Convert sharded_corpus docs to numpy style
anotherbugmaster Nov 21, 2017
7823546
Convert svmlightcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
9878133
Convert textcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
dba4429
Convert ucicorpus docs to numpy style
anotherbugmaster Nov 21, 2017
6a95c94
Convert wikicorpus docs to numpy style
anotherbugmaster Nov 21, 2017
6dcfb07
Add sphinx tweaks
anotherbugmaster Nov 21, 2017
2f61fc3
Merge remote-tracking branch 'upstream/develop' into develop
anotherbugmaster Nov 21, 2017
ac01abb
Merge branch 'develop' into fix_1605
anotherbugmaster Nov 21, 2017
833ec64
Remove trailing whitespaces
anotherbugmaster Nov 21, 2017
e656609
Merge branch 'develop' into fix_1605
anotherbugmaster Nov 23, 2017
3e597fe
Annotate wikicorpus
anotherbugmaster Nov 28, 2017
da1d5c2
SVMLight Corpus annotated
anotherbugmaster Dec 5, 2017
89f6098
Fix TODO
anotherbugmaster Dec 5, 2017
9eeea21
Fix grammar mistake
anotherbugmaster Dec 6, 2017
2b6aeaf
Undo changes to dictionary
anotherbugmaster Dec 7, 2017
9b17057
Undo changes to hashdictionary
anotherbugmaster Dec 7, 2017
de3ea0f
Document indexedcorpus
anotherbugmaster Dec 9, 2017
dafc373
Document indexedcorpus[2]
anotherbugmaster Dec 10, 2017
ff980bc
Merge upstream
anotherbugmaster Jan 9, 2018
0189d8d
Remove redundant files
anotherbugmaster Jan 11, 2018
943406c
Merge upstream
anotherbugmaster Jan 16, 2018
57cb5a3
Add more dots. :)
anotherbugmaster Jan 16, 2018
08ca492
Fix monospace
anotherbugmaster Jan 16, 2018
381fb97
remove useless method
menshikh-iv Jan 18, 2018
5b5701a
fix bleicorpus
menshikh-iv Jan 18, 2018
0e5c0cf
fix csvcorpus
menshikh-iv Jan 18, 2018
627c0e5
fix indexedcorpus
menshikh-iv Jan 18, 2018
b771bb5
fix svmlightcorpus
menshikh-iv Jan 18, 2018
d76af8d
fix wikicorpus[1]
menshikh-iv Jan 18, 2018
7fe753f
fix wikicorpus[2]
menshikh-iv Jan 18, 2018
a9eb1a3
fix wikicorpus[3]
menshikh-iv Jan 18, 2018
e3a8ebf
fix review comments
menshikh-iv Jan 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 73 additions & 23 deletions gensim/corpora/bleicorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


"""
Blei's LDA-C format.
"""
"""Сorpus in Blei's LDA-C format."""

from __future__ import with_statement

Expand All @@ -19,30 +17,39 @@
from six.moves import xrange


logger = logging.getLogger('gensim.corpora.bleicorpus')
logger = logging.getLogger(__name__)


class BleiCorpus(IndexedCorpus):
"""
Corpus in Blei's LDA-C format.
"""Corpus in Blei's LDA-C format.

The corpus is represented as two files: one describing the documents, and another
describing the mapping between words and their ids.

Each document is one line::

N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN


The vocabulary is a file with words, one word per line; word at line K has an implicit `id=K`.

The vocabulary is a file with words, one word per line; word at line K has an
implicit ``id=K``.
"""

def __init__(self, fname, fname_vocab=None):
"""
Initialize the corpus from a file.

`fname_vocab` is the file with vocabulary; if not specified, it defaults to
`fname.vocab`.
Parameters
----------
fname : str
File path to Serialized corpus.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path to corpus here and in other corpora maybe?

fname_vocab : str, optional
Vocabulary file. If `fname_vocab` is None, searching for the vocab.txt or `fname_vocab`.vocab file.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's fname_vocab.vocab? fname_vocab is none, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, I added correct description

Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't get it. It should be `fname`.vocab, `fname_vocab`.vocab is undefined!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite :) I go through the code with ipdb for this case, this is significantly "wider" that we discuss here (I already fix it).

Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabulary file. If fname_vocab is None, searching for the vocab.txt or fname.vocab file.


Raises
------
IOError
If vocabulary file doesn't exist.

"""
IndexedCorpus.__init__(self, fname)
logger.info("loading corpus from %s", fname)
Expand All @@ -67,8 +74,13 @@ def __init__(self, fname, fname_vocab=None):
self.id2word = dict(enumerate(words))

def __iter__(self):
"""
Iterate over the corpus, returning one sparse vector at a time.
"""Iterate over the corpus, returning one sparse (BoW) vector at a time.

Yields
------
list of (int, float)
Document's BoW representation.

"""
lineno = -1
with utils.smart_open(self.fname) as fin:
Expand All @@ -77,6 +89,19 @@ def __iter__(self):
self.length = lineno + 1

def line2doc(self, line):
"""Convert line in Blei LDA-C format to document (BoW representation).

Parameters
----------
line : str
Line in Blei's LDA-C format.

Returns
-------
list of (int, float)
Document's BoW representation.

"""
parts = utils.to_unicode(line).split()
if int(parts[0]) != len(parts) - 1:
raise ValueError("invalid format in %s: %s" % (self.fname, repr(line)))
Expand All @@ -86,14 +111,28 @@ def line2doc(self, line):

@staticmethod
def save_corpus(fname, corpus, id2word=None, metadata=False):
"""
Save a corpus in the LDA-C format.

There are actually two files saved: `fname` and `fname.vocab`, where
`fname.vocab` is the vocabulary file.
"""Save a corpus in the LDA-C format.

Notes
-----
There are actually two files saved: `fname` and `fname.vocab`, where `fname.vocab` is the vocabulary file.

Parameters
----------
fname : str
Path to output filename.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To output file

corpus : iterable of iterable of (int, float)
Input corpus
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obvious, no additional information provided. There's no need to have descriptions for all arguments. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think that it's not necessary. Also, there's a dot missing at the end of the line.

id2word : dict of (str, str), optional
Mapping id -> word for `corpus`.
metadata : bool, optional
THIS PARAMETER WILL BE IGNORED.

Returns
-------
list of int
Offsets for each line in file (in bytes).

This function is automatically called by `BleiCorpus.serialize`; don't
call it directly, call `serialize` instead.
"""
if id2word is None:
logger.info("no word id mapping provided; initializing from corpus")
Expand Down Expand Up @@ -121,8 +160,19 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
return offsets

def docbyoffset(self, offset):
"""
Return the document stored at file position `offset`.
"""Get document corresponding to `offset`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First line of docstring should always end with a dot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line should end with a dot.

offset can be given from :meth:`~gensim.corpora.bleicorpus.BleiCorpus.save_corpus`.

Parameters
----------
offset : int
Position of the document in the file (in bytes).

Returns
-------
list of (int, float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameter description (here and everywhere)

Document in BoW format.

"""
with utils.smart_open(self.fname) as f:
f.seek(offset)
Expand Down
30 changes: 18 additions & 12 deletions gensim/corpora/csvcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@
# Copyright (C) 2013 Zygmunt Zając <zygmunt@fastml.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Corpus in CSV format.

"""
"""Corpus in CSV format."""


from __future__ import with_statement
Expand All @@ -22,18 +19,24 @@


class CsvCorpus(interfaces.CorpusABC):
"""
Corpus in CSV format. The CSV delimiter, headers etc. are guessed automatically
based on the file content.
"""Corpus in CSV format.

The CSV delimiter, headers etc. are guessed automatically based on the
file content.

All row values are expected to be ints/floats.

"""

def __init__(self, fname, labels):
"""
Initialize the corpus from a file.
`labels` = are class labels present in the input file? => skip the first column
"""Initialize the corpus from a file.

Parameters
----------
fname : str
Filename.
labels : bool
Whether to skip the first column.

"""
logger.info("loading corpus from %s", fname)
Expand All @@ -48,8 +51,11 @@ def __init__(self, fname, labels):
logger.info("sniffed CSV delimiter=%r, headers=%s", self.dialect.delimiter, self.headers)

def __iter__(self):
"""
Iterate over the corpus, returning one sparse vector at a time.
"""Iterate over the corpus, returning one sparse vector at a time.

Yields
------
list of (int, float)

"""
reader = csv.reader(utils.smart_open(self.fname), self.dialect)
Expand Down
106 changes: 72 additions & 34 deletions gensim/corpora/indexedcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,7 @@
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


"""
Indexed corpus is a mechanism for random-accessing corpora.

While the standard corpus interface in gensim allows iterating over corpus with
`for doc in corpus: pass`, indexed corpus allows accessing the documents with
`corpus[docno]` (in O(1) look-up time).

This functionality is achieved by storing an extra file (by default named the same
as the corpus file plus '.index' suffix) that stores the byte offset of the beginning
of each document.
"""
"""Base Indexed Corpus class."""

import logging
import six
Expand All @@ -28,13 +18,32 @@


class IndexedCorpus(interfaces.CorpusABC):
"""Indexed corpus is a mechanism for random-accessing corpora.

While the standard corpus interface in gensim allows iterating over
corpus with `for doc in corpus: pass`, indexed corpus allows accessing
the documents with `corpus[docno]` (in O(1) look-up time).

Notes
-----
This functionality is achieved by storing an extra file (by default
named the same as the '{corpus name}.index') that stores the byte
offset of the beginning of each document.

"""

def __init__(self, fname, index_fname=None):
"""
Initialize this abstract base class, by loading a previously saved index
from `index_fname` (or `fname.index` if `index_fname` is not set).
This index will allow subclasses to support the `corpus[docno]` syntax
(random access to document #`docno` in O(1)).
"""Initialize the corpus.

Parameters
----------
fname : string
Filename.
index_fname : string or None
Index filename, or None for loading `fname`.index.

Examples
--------
>>> # save corpus in SvmLightCorpus format with an index
>>> corpus = [[(1, 0.5)], [(0, 1.0), (1, 2.0)]]
>>> gensim.corpora.SvmLightCorpus.serialize('testfile.svmlight', corpus)
Expand All @@ -58,22 +67,31 @@ def __init__(self, fname, index_fname=None):
@classmethod
def serialize(serializer, fname, corpus, id2word=None, index_fname=None,
progress_cnt=None, labels=None, metadata=False):
"""
Iterate through the document stream `corpus`, saving the documents to `fname`
and recording byte offset of each document. Save the resulting index
structure to file `index_fname` (or `fname`.index is not set).

This relies on the underlying corpus class `serializer` providing (in
addition to standard iteration):

* `save_corpus` method that returns a sequence of byte offsets, one for
each saved document,
* the `docbyoffset(offset)` method, which returns a document
positioned at `offset` bytes within the persistent storage (file).
* metadata if set to true will ensure that serialize will write out article titles to a pickle file.

Example:

"""Iterate through the document stream `corpus`.

Saving the documents to
`fname` and recording byte offset of each document.

Parameters
----------
fname : str
Filename.
corpus : iterable
Iterable of documents.
id2word : dict of (str, str), optional
Transforms id to word.
index_fname : str
Where to save resulting index. Saved to `fname`.index if None.
progress_cnt : int
Number of documents after which progress info is printed.
labels : bool
Whether to skip the first column (class labels).
metadata : bool
If True will ensure that serialize will write out
article titles to a pickle file. (Default value = False).

Examples
--------
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
Expand Down Expand Up @@ -108,8 +126,15 @@ def serialize(serializer, fname, corpus, id2word=None, index_fname=None,

def __len__(self):
"""
Return the index length if the corpus is indexed. Otherwise, make a pass
over self to calculate the corpus length and cache this number.
Return the index length.

If the corpus is not indexed, also count corpus length and cache this
value.

Returns
-------
int

"""
if self.index is not None:
return len(self.index)
Expand All @@ -119,11 +144,24 @@ def __len__(self):
return self.length

def __getitem__(self, docno):
"""Return certain document.

Parameters
----------
docno : int
Document number.

Returns
-------
`utils.SlicedCorpus`

"""
if self.index is None:
raise RuntimeError("Cannot call corpus[docid] without an index")
if isinstance(docno, (slice, list, numpy.ndarray)):
return utils.SlicedCorpus(self, docno)
elif isinstance(docno, six.integer_types + (numpy.integer,)):
return self.docbyoffset(self.index[docno])
# TODO: no `docbyoffset` method, should be defined in this class
else:
raise ValueError('Unrecognised value for docno, use either a single integer, a slice or a numpy.ndarray')
Loading