Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor API reference gensim.corpora. Partial fix #1671 #1835

Merged
merged 25 commits into from
Feb 9, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/src/corpora/lowcorpus.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:mod:`corpora.lowcorpus` -- Corpus in List-of-Words format
===========================================================
:mod:`corpora.lowcorpus` -- Corpus in GibbsLda++ format
=======================================================

.. automodule:: gensim.corpora.lowcorpus
:synopsis: Corpus in List-of-Words format
:synopsis: Corpus in GibbsLda++ format
:members:
:inherited-members:
:undoc-members:
Expand Down
6 changes: 3 additions & 3 deletions docs/src/corpora/malletcorpus.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:mod:`corpora.malletcorpus` -- Corpus in Mallet format of List-Of-Words.
========================================================================
:mod:`corpora.malletcorpus` -- Corpus in Mallet format
======================================================

.. automodule:: gensim.corpora.malletcorpus
:synopsis: Corpus in Mallet format of List-Of-Words.
:synopsis: Corpus in Mallet format.
:members:
:inherited-members:
:undoc-members:
Expand Down
6 changes: 3 additions & 3 deletions docs/src/corpora/textcorpus.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:mod:`corpora.textcorpus` -- Building corpora with dictionaries
=================================================================
:mod:`corpora.textcorpus` -- Tools for building corpora with dictionaries
=========================================================================

.. automodule:: gensim.corpora.textcorpus
:synopsis: Building corpora with dictionaries
:synopsis: Tools for building corpora with dictionaries
:members:
:inherited-members:
:undoc-members:
Expand Down
6 changes: 3 additions & 3 deletions docs/src/corpora/ucicorpus.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
:mod:`corpora.ucicorpus` -- Corpus in UCI bag-of-words format
==============================================================================================================
:mod:`corpora.ucicorpus` -- Corpus in UCI format
================================================

.. automodule:: gensim.corpora.ucicorpus
:synopsis: Corpus in University of California, Irvine (UCI) bag-of-words format
:synopsis: Corpus in UCI format
:members:
:inherited-members:
:undoc-members:
Expand Down
155 changes: 126 additions & 29 deletions gensim/corpora/lowcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


"""
Corpus in GibbsLda++ format of List-Of-Words.
"""
"""Corpus in `GibbsLda++ format <http://gibbslda.sourceforge.net/>`_."""

from __future__ import with_statement

Expand All @@ -19,48 +17,79 @@
from six.moves import xrange, zip as izip


logger = logging.getLogger('gensim.corpora.lowcorpus')
logger = logging.getLogger(__name__)


def split_on_space(s):
"""Split line by spaces, used in :class:`gensim.corpora.lowcorpus.LowCorpus`.

Parameters
----------
s : str
Some line.

Returns
-------
list of str
List of tokens from `s`.

"""
return [word for word in utils.to_unicode(s).strip().split(' ') if word]


class LowCorpus(IndexedCorpus):
"""
List_Of_Words corpus handles input in GibbsLda++ format.
"""Corpus handles input in `GibbsLda++ format <http://gibbslda.sourceforge.net/>`_.

Quoting http://gibbslda.sourceforge.net/#3.2_Input_Data_Format::
**Format description**

Both data for training/estimating the model and new data (i.e., previously
unseen data) have the same format as follows:
Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format
as follows ::

[M]
[document1]
[document2]
...
[documentM]

in which the first line is the total number for documents [M]. Each line
after that is one document. [documenti] is the ith document of the dataset
that consists of a list of Ni words/terms.
in which the first line is the total number for documents [M]. Each line after that is one document.
[documenti] is the ith document of the dataset that consists of a list of Ni words/terms ::

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated
by the blank character.
in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.

Examples
--------
>>> from gensim.test.utils import datapath, get_tmpfile, common_texts
>>> from gensim.corpora import LowCorpus
>>> from gensim.corpora import Dictionary
>>>
>>> # Prepare needed data
>>> dictionary = Dictionary(common_texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in common_texts]
>>>
>>> # Write corpus in GibbsLda++ format to disk
>>> output_fname = get_tmpfile("corpus.low")
>>> LowCorpus.serialize(output_fname, corpus, dictionary)
>>>
>>> # Read corpus
>>> loaded_corpus = LowCorpus(output_fname)

"""
def __init__(self, fname, id2word=None, line2words=split_on_space):
"""
Initialize the corpus from a file.

`id2word` and `line2words` are optional parameters.
If provided, `id2word` is a dictionary mapping between word_ids (integers)
and words (strings). If not provided, the mapping is constructed from
the documents.
Parameters
----------
fname : str
Path to file in GibbsLda++ format.
id2word : {dict of (int, str), :class:`~gensim.corpora.dictionary.Dictionary`}, optional
Mapping between word_ids (integers) and words (strings).
If not provided, the mapping is constructed directly from `fname`.
line2words : callable, optional
Function which converts lines(str) into tokens(list of str),
using :func:`~gensim.corpora.lowcorpus.split_on_space` as default.

`line2words` is a function which converts lines into tokens. Defaults to
simple splitting on spaces.
"""
IndexedCorpus.__init__(self, fname)
logger.info("loading corpus from %s", fname)
Expand Down Expand Up @@ -91,6 +120,14 @@ def __init__(self, fname, id2word=None, line2words=split_on_space):
)

def _calculate_num_docs(self):
"""Get number of documents in file.

Returns
-------
int
Number of documents.

"""
# the first line in input data is the number of documents (integer). throws exception on bad input.
with utils.smart_open(self.fname) as fin:
try:
Expand All @@ -104,6 +141,19 @@ def __len__(self):
return self.num_docs

def line2doc(self, line):
"""Covert line into document in BoW format.

Parameters
----------
line : str
Line from input file.

Returns
-------
list of (int, int)
Document in BoW format

"""
words = self.line2words(line)

if self.use_wordids:
Expand Down Expand Up @@ -132,8 +182,13 @@ def line2doc(self, line):
return doc

def __iter__(self):
"""
Iterate over the corpus, returning one bag-of-words vector at a time.
"""Iterate over the corpus.

Yields
------
list of (int, int)
Document in BoW format.

"""
with utils.smart_open(self.fname) as fin:
for lineno, line in enumerate(fin):
Expand All @@ -142,11 +197,31 @@ def __iter__(self):

@staticmethod
def save_corpus(fname, corpus, id2word=None, metadata=False):
"""
Save a corpus in the List-of-words format.
"""Save a corpus in the GibbsLda++ format.

Warnings
--------
This function is automatically called by :meth:`gensim.corpora.lowcorpus.LowCorpus.serialize`,
don't call it directly, call :meth:`gensim.corpora.lowcorpus.LowCorpus.serialize` instead.

Parameters
----------
fname : str
Path to output file.
corpus : iterable of iterable of (int, int)
Corpus in BoW format.
id2word : {dict of (int, str), :class:`~gensim.corpora.dictionary.Dictionary`}, optional
Mapping between word_ids (integers) and words (strings).
If not provided, the mapping is constructed directly from `corpus`.
metadata : bool, optional
THIS PARAMETER WILL BE IGNORED.

Return
------
list of int
List of offsets in resulting file for each document (in bytes),
can be used for :meth:`~gensim.corpora.lowcorpus.LowCorpus.docbyoffset`

This function is automatically called by `LowCorpus.serialize`; don't
call it directly, call `serialize` instead.
"""
if id2word is None:
logger.info("no word id mapping provided; initializing from corpus")
Expand Down Expand Up @@ -174,15 +249,37 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
return offsets

def docbyoffset(self, offset):
"""
Return the document stored at file position `offset`.
"""Get the document stored in file by `offset` position.

Parameters
----------
offset : int
Offset (in bytes) to begin of document.

Returns
-------
list of (int, int)
Document in BoW format.

Examples
--------
>>> from gensim.test.utils import datapath
>>> from gensim.corpora import LowCorpus
>>>
>>> data = LowCorpus(datapath("testcorpus.low"))
>>> data.docbyoffset(1) # end of first line
[]
>>> data.docbyoffset(2) # start of second line
[(0, 1), (3, 1), (4, 1)]

"""
with utils.smart_open(self.fname) as f:
f.seek(offset)
return self.line2doc(f.readline())

@property
def id2word(self):
"""Get mapping between words and their ids."""
return self._id2word

@id2word.setter
Expand Down
Loading