-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix documentation for gensim.corpora
. Partial fix #1671
#1729
Changes from 47 commits
b260d4b
36d98d1
981ebbb
3428113
69fc7e0
b65a69a
6fa92f3
78e207d
7519382
ae69867
c2765ed
40add21
e044c3a
123327d
2382d01
42409bf
7cb5bbf
56f19e6
9162a7e
5eaaac4
3b6b076
d7f3fc8
c46bff4
7823546
9878133
dba4429
6a95c94
6dcfb07
2f61fc3
ac01abb
833ec64
e656609
3e597fe
da1d5c2
89f6098
9eeea21
2b6aeaf
9b17057
de3ea0f
dafc373
ff980bc
0189d8d
943406c
57cb5a3
08ca492
381fb97
5b5701a
0e5c0cf
627c0e5
b771bb5
d76af8d
7fe753f
a9eb1a3
e3a8ebf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,9 +5,7 @@ | |
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
|
||
""" | ||
Blei's LDA-C format. | ||
""" | ||
"""Сorpus in Blei's LDA-C format.""" | ||
|
||
from __future__ import with_statement | ||
|
||
|
@@ -19,30 +17,39 @@ | |
from six.moves import xrange | ||
|
||
|
||
logger = logging.getLogger('gensim.corpora.bleicorpus') | ||
logger = logging.getLogger(__name__) | ||
|
||
|
||
class BleiCorpus(IndexedCorpus): | ||
""" | ||
Corpus in Blei's LDA-C format. | ||
"""Corpus in Blei's LDA-C format. | ||
|
||
The corpus is represented as two files: one describing the documents, and another | ||
describing the mapping between words and their ids. | ||
|
||
Each document is one line:: | ||
|
||
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN | ||
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN | ||
|
||
|
||
The vocabulary is a file with words, one word per line; word at line K has an implicit `id=K`. | ||
|
||
The vocabulary is a file with words, one word per line; word at line K has an | ||
implicit ``id=K``. | ||
""" | ||
|
||
def __init__(self, fname, fname_vocab=None): | ||
""" | ||
Initialize the corpus from a file. | ||
|
||
`fname_vocab` is the file with vocabulary; if not specified, it defaults to | ||
`fname.vocab`. | ||
Parameters | ||
---------- | ||
fname : str | ||
File path to Serialized corpus. | ||
fname_vocab : str, optional | ||
Vocabulary file. If `fname_vocab` is None, searching for the vocab.txt or `fname_vocab`.vocab file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure it's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not quite, I added correct description There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still don't get it. It should be `fname`.vocab, `fname_vocab`.vocab is undefined! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not quite :) I go through the code with ipdb for this case, this is significantly "wider" that we discuss here (I already fix it). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Vocabulary file. If |
||
|
||
Raises | ||
------ | ||
IOError | ||
If vocabulary file doesn't exist. | ||
|
||
""" | ||
IndexedCorpus.__init__(self, fname) | ||
logger.info("loading corpus from %s", fname) | ||
|
@@ -67,8 +74,13 @@ def __init__(self, fname, fname_vocab=None): | |
self.id2word = dict(enumerate(words)) | ||
|
||
def __iter__(self): | ||
""" | ||
Iterate over the corpus, returning one sparse vector at a time. | ||
"""Iterate over the corpus, returning one sparse (BoW) vector at a time. | ||
|
||
Yields | ||
------ | ||
list of (int, float) | ||
Document's BoW representation. | ||
|
||
""" | ||
lineno = -1 | ||
with utils.smart_open(self.fname) as fin: | ||
|
@@ -77,6 +89,19 @@ def __iter__(self): | |
self.length = lineno + 1 | ||
|
||
def line2doc(self, line): | ||
"""Convert line in Blei LDA-C format to document (BoW representation). | ||
|
||
Parameters | ||
---------- | ||
line : str | ||
Line in Blei's LDA-C format. | ||
|
||
Returns | ||
------- | ||
list of (int, float) | ||
Document's BoW representation. | ||
|
||
""" | ||
parts = utils.to_unicode(line).split() | ||
if int(parts[0]) != len(parts) - 1: | ||
raise ValueError("invalid format in %s: %s" % (self.fname, repr(line))) | ||
|
@@ -86,14 +111,28 @@ def line2doc(self, line): | |
|
||
@staticmethod | ||
def save_corpus(fname, corpus, id2word=None, metadata=False): | ||
""" | ||
Save a corpus in the LDA-C format. | ||
|
||
There are actually two files saved: `fname` and `fname.vocab`, where | ||
`fname.vocab` is the vocabulary file. | ||
"""Save a corpus in the LDA-C format. | ||
|
||
Notes | ||
----- | ||
There are actually two files saved: `fname` and `fname.vocab`, where `fname.vocab` is the vocabulary file. | ||
|
||
Parameters | ||
---------- | ||
fname : str | ||
Path to output filename. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To output file |
||
corpus : iterable of iterable of (int, float) | ||
Input corpus | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Obvious, no additional information provided. There's no need to have descriptions for all arguments. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still think that it's not necessary. Also, there's a dot missing at the end of the line. |
||
id2word : dict of (str, str), optional | ||
Mapping id -> word for `corpus`. | ||
metadata : bool, optional | ||
THIS PARAMETER WILL BE IGNORED. | ||
|
||
Returns | ||
------- | ||
list of int | ||
Offsets for each line in file (in bytes). | ||
|
||
This function is automatically called by `BleiCorpus.serialize`; don't | ||
call it directly, call `serialize` instead. | ||
""" | ||
if id2word is None: | ||
logger.info("no word id mapping provided; initializing from corpus") | ||
|
@@ -121,8 +160,19 @@ def save_corpus(fname, corpus, id2word=None, metadata=False): | |
return offsets | ||
|
||
def docbyoffset(self, offset): | ||
""" | ||
Return the document stored at file position `offset`. | ||
"""Get document corresponding to `offset`, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First line of docstring should always end with a dot. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The first line should end with a dot. |
||
offset can be given from :meth:`~gensim.corpora.bleicorpus.BleiCorpus.save_corpus`. | ||
|
||
Parameters | ||
---------- | ||
offset : int | ||
Position of the document in the file (in bytes). | ||
|
||
Returns | ||
------- | ||
list of (int, float) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing parameter description (here and everywhere) |
||
Document in BoW format. | ||
|
||
""" | ||
with utils.smart_open(self.fname) as f: | ||
f.seek(offset) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path to corpus here and in other corpora maybe?