Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cython version of MmReader #1825

Merged
merged 27 commits into from
Feb 21, 2018
Merged

Add Cython version of MmReader #1825

merged 27 commits into from
Feb 21, 2018

Conversation

arlenk
Copy link
Contributor

@arlenk arlenk commented Jan 4, 2018

Very naive version of mmreader in cython.

From some profiling, when simply looping through an uncompressed mm corpus, 60% of the time is spent splitting each line and converting the strings to ints (or floats). For a compressed corpus, about 45% of the time is spent in splitting and converting to ints/floats.

This version simply performs those two steps in cython using sscanf. The rest of the code is pretty much unchanged -- though still compiled in cython for simplicity. This change has a decent impact on performance, especially for uncompressed files. The table below shows some profiling results for looping through a few different corpuses in mm format (the tfidf corpus is used to make sure the code works with an mm file that has floats as well).

num_docs num_terms compressed? time (current) time (new) speed up
enron 40k 28k no 22.3 2.6 8.7
yes 37.3 14.4 2.6
nytimes 300k 100k no 419.3 49.2 8.5
yes 686.2 275.1 2.5
text8 2k 250k no 25.4 2.5 10.1
yes 41.9 17.0 2.5
text8 (tfidf) 2k 250k no 30.0 5.5 5.4
yes 47.8 21.1 2.3

In theory, the rest of the code could be converted into c code as well, though that would require writing c code to handle all the different file types that the mmreader accepts (eg, compressed files) and from profiling results, that portion of the code only accounts for 10% (uncompressed files) to 30% (compressed files) of the time.

The real world impact of this change is much more modest than the results above (unless for some reason you are just looping through a corpus...), but I have seen time savings of 15% in some real world uses of LDA with this change. Obviously the impact any user sees will depend on the size of their corpus and the type of model they are using.

@arlenk arlenk changed the title simple cython version of mmreader [WIP] simple cython version of mmreader Jan 4, 2018
@piskvorky
Copy link
Owner

piskvorky commented Jan 4, 2018

This is great! Speeding up the "basic blocks" (MmCorpus included) is a much needed direction.

@arlenk Are there any cons to this change? What are the disadvantages, trade-offs?

@menshikh-iv shall we keep the old "pure-Python" version, or do we drop the slow versions and keep only the compiled code path? I'm +1 for dropping, now that we offer a good system for binary (pre-compiled) wheels to Gensim users.

@arlenk
Copy link
Contributor Author

arlenk commented Jan 6, 2018

@arlenk Are there any cons to this change? What are the disadvantages, trade-offs?

Here are the cons I can think of:

  • keeping both the python & cython versions means a fair amount of code duplication. Anyone fixing bugs or adding features would have to remember to update both files
  • keeping only the cython version, on the other hand, makes the code less accessible (in my opinion) to end users. Not sure how often people will care about looking at (or changing) the internals of mmreader, but keeping this only in cython will make that harder.
  • dealing with strings in c/cython always makes me worry about encoding issues. Since mm files should only have numbers, though, I can't think of a reason why we would ever run into non ascii files, but maybe others can think potential encoding issues.

@menshikh-iv
Copy link
Contributor

@piskvorky I'm +1 for dropping too (because we have wheels right now), this has no sense to support 2 versions.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work 👍

setup.py Outdated
@@ -256,9 +256,12 @@ def finalize_options(self):
Extension('gensim.models.doc2vec_inner',
sources=['./gensim/models/doc2vec_inner.c'],
include_dirs=[model_dir]),
Extension('gensim.corpora._mmreader',
sources=['./gensim/corpora/_mmreader.c'],
include_dirs=[model_dir]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why model_dir here? This isn't related to gensim/models.


def __init__(self, input, transposed=True):
"""
Initialize the matrix reader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style docstrings (here and everywhere).

logger = logging.getLogger(__name__)


cdef class MmReader(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost same as a pure-python version (sscanf different), maybe something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing else really. It's a very trivial change. Just a change to use sscanf instead of split() and int() and a few type definitions (eg, docid, termid, and previd in iter).

No doubt converting the actual file access to c would further improve performance, but that will require dealing with all the different file types that smart_open accepts. Also, I don't think this is the main hotspot in iter.

For example, here's a line profiler output of the current (develop branch) code when reading a raw text file:

Line #  % Time  Line Contents
==============================
   907     0.0          with utils.file_or_filename(self.input) as lines:
   908     0.0              self.skip_headers(lines)

   911     8.6              for line in lines:
   912    40.4                  docid, termid, val = utils.to_unicode(line).split()  # needed for python3
   913     7.0                  if not self.transposed:
   914                              termid, docid = docid, termid
   915                          # -1 because matrix market indexes are 1-based => convert to 0-based
   916    22.5                  docid, termid, val = int(docid) - 1, int(termid) - 1, float(val)
   917     6.7                  assert previd <= docid, "matrix columns must come in ascending order"
   918     6.4                  if docid != previd:

   931
   932     8.4                  document.append((termid, val,))  # add another field to the current document
   933

The majority of time is spent splitting the line and converting the resulting strings to ints (lines 912 and 916). sscanf handles both of these lines. Another 20% or so is taken up by the testing self.transposed and comparing docid and previd (lines 913, 917, and 918). Statically typing these variables in cython helps quite a bt too.

For a compressed text file, line 911 (for line in lines) takes up 33% of the time, but most of the time is still taken by line.split() and int().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the line profile output when reading a compressed file

Line #  % Time  Line Contents
==============================
   907     0.0          with utils.file_or_filename(self.input) as lines:
   908     0.0              self.skip_headers(lines)

   911    31.5              for line in lines:
   912    30.6                  docid, termid, val = utils.to_unicode(line).split()  # needed for python3
   913     5.1                  if not self.transposed:
   914                              termid, docid = docid, termid
   915                          # -1 because matrix market indexes are 1-based => convert to 0-based
   916    17.0                  docid, termid, val = int(docid) - 1, int(termid) - 1, float(val)
   917     4.9                  assert previd <= docid, "matrix columns must come in ascending order"
   918     4.6                  if docid != previd:

   932     6.3                  document.append((termid, val,))  # add another field to the current document

@menshikh-iv
Copy link
Contributor

@arlenk please update develop branch in your fork and merge fresh develop here (this needed for correct doc building).

@arlenk
Copy link
Contributor Author

arlenk commented Jan 10, 2018

@menshikh-iv, i updated my develop branch

If you want to drop the python only version, let me know, and I can update the PR to only include the cython version (perhaps in a gensim.corpora.mmreader module).

@menshikh-iv
Copy link
Contributor

@arlenk great, yes, please remove pure python version, and this LGTM, wdyt @piskvorky?

@piskvorky
Copy link
Owner

piskvorky commented Jan 11, 2018

What are the compatibility implications of this? Any changes needed in existing user code? Does it support the same range of inputs (compressed etc)?

@arlenk
Copy link
Contributor Author

arlenk commented Jan 12, 2018

What are the compatibility implications of this?

I don't think there are any compatibility implications. It should be a drop in replacement for the current mmreader.

Any changes needed in existing user code? Does it support the same range of inputs (compressed etc)?

Yes, the actual opening of the file and looping through the rows is still handled by python/smart_open. The only part that is in c is "splitting" each line and converting the values to ints/floats.

I added a few more unittests for MmCorpus as well, including one with a compressed file.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, problem is

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "gensim/__init__.py", line 6, in <module>
    from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils  # noqa:F401
  File "gensim/corpora/__init__.py", line 8, in <module>
    from .mmcorpus import MmCorpus  # noqa:F401
  File "gensim/corpora/mmcorpus.py", line 17, in <module>
    from gensim.corpora.mmreader import MmReader
ImportError: No module named mmreader

I'm not sure, but maybe "incule_dir" solve this problem.

setup.py Outdated
@@ -256,9 +256,11 @@ def finalize_options(self):
Extension('gensim.models.doc2vec_inner',
sources=['./gensim/models/doc2vec_inner.c'],
include_dirs=[model_dir]),
Extension('gensim.corpora.mmreader',
sources=['./gensim/corpora/mmreader.c']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, "include_dirs" should be gensim/corpora

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @menshikh-iv . It looks like it is failing on the bold line below in the tox config:

python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)"
python setup.py build_ext --inplace
python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)"

This tries to import word2vec before any of the cython modules are compiled. word2vec imports gensim.corpor, which import mmreader. Since I took out the python version of mmreader, this line is failing as the cython version isn't built yet (this happens on the next line of the tox config).

For now, I removed the bold line to confirm the build works. If the plan is to only include the c/cython versions of code, I am not sure if the bold line above is still necessary. If it is still necessary, then I can add a try/except to gensim.corpora.init, though that would only be used for this line in the tox file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arlenk aha, this means that now we can't use FAST_VERSION freely, this isn't good.

Probably, we need to move this constant (because this needed for checking that all compiled successfully), to a zone where we didn't receive an exception if something goes wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added back the FAST_VERSION check to the tox config, so the config is the same as before.

To make the test run, I had to add a try/except block to gensim.corpora.init as that file tries to import MmCorpus. Since this check happens before the c-extension is built, MmCorpus fails because there is no pure python MmReader any more. The except block in the init file just creates a dummy MmCorpus class so the code "works" (ie, the imports don't fail, but obviously the dummy classes will not work if used).

This is not a very elegant solution, but I can't think of a better option if that check in the tox file is necessary. Maybe there is a different way to check if the compilation step failed? Also, is the compilation step still necessary with a wheel distribution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not a good solution.
Compilation step is a mandatory step for wheels, I agree, but we also have repo & developers who work with it.

As alternative, I propose to replace import in tox.ini to something like this from gensim.models.word2vec_inner import FAST_VERSION, i.e. direct import, this will be helpful for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As alternative, I propose to replace import in tox.ini to something like this from gensim.models.word2vec_inner import FAST_VERSION, i.e. direct import, this will be helpful for this case?

That's a good idea, but unfortunately that fails too. Importing anything under gensim.* implicitly runs gensim/init.py, which imports gensim.corpora, which imports mmreader (which doesn't exist yet at this point in the tox process).

Do you know what this line in the tox config is meant to test?

python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)"
python setup.py build_ext --inplace
python -c "from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)"
pytest {posargs:gensim/test}

It seems like it tests whether all the "pure python" code installed correctly, but I wonder if there is a different way to test that. Or is that test necessary since either the following line (the setup.py line) or the pytest line will fail if the install was not good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what this line in the tox config is meant to test?

I added it when I write the first version of tox.ini for sanity checking only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to remove this (and also remove try/except block in python file)

Copy link
Contributor Author

@arlenk arlenk Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what this line in the tox config is meant to test?

I added it when I write the first version of tox.ini for sanity checking only

ah, ok, that makes sense. I will remove this line for now and drop the try/except.

thanks

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 10, 2018

Hello @arlenk, sorry for waiting, can you make last changes

  • Merge fresh changes from develop
  • Return python version of MmReader (that will be used for backward compatibility until major release, will be dropped before next major).
  • Return FAST_VERSION first check too (same reason).

and PR will be ready to merge.

@@ -15,7 +15,7 @@
from gensim import utils
from gensim.corpora import Dictionary
from gensim.corpora import IndexedCorpus
from gensim.matutils import MmReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move MmReader to matutils (yes, this is wired, but needed for backward compatibility). We'll move it to corpora with major release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok done. I kept the cython version as "gensim.corpora._mmreader". Let me know if you want this renamed to gensim.corpora.mmreader or something else.

@@ -256,9 +256,11 @@ def finalize_options(self):
Extension('gensim.models.doc2vec_inner',
sources=['./gensim/models/doc2vec_inner.c'],
include_dirs=[model_dir]),
Extension('gensim.corpora._mmreader',
sources=['./gensim/corpora/_mmreader.c']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include_dirs no needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I don't think it's needed. The other pyx/c files include model_dir because they reference each other, but this module doesn't need to reference any of the other cython files.

# try to load fast, cythonized code if possible
from gensim.corpora._mmreader import MmReader
FAST_VERSION = 1
logger.info('Fast version of MmReader is being used')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to log anything here (and after, because this will be shown for each import of gensim at all (because we import all stuff in __init__.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will remove

try:
# try to load fast, cythonized code if possible
from gensim.corpora._mmreader import MmReader
FAST_VERSION = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need FAST_VERSION here at all (this is connected with BLAS stuff, that isn't used here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good will remove this too

@@ -19,21 +19,78 @@
logger = logging.getLogger('gensim.corpora.mmcorpus')


class MmCorpus(matutils.MmReader, IndexedCorpus):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this try/except section to matutils.py and restore all imports (this is simplest and clearest way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving the import to matutils, but it causes some issues with circular imports.

For example, if I put the following in matutils.py:

try:
    # try loading cython version of mmreader
    from gensim.corpora._mmreader import MmReader
except ImportError:
    # else fallback to python version
    class MmReader(object):
        pass

When the try block tries to import gensim.corpora._mmreader, that executes gensim.corpora.__init__, which tries to import gensim.corpora.mmcorpus.MmCorpus, which unfortunately tries to import matutils.MmReader, which isn't defined yet (as we are still in the try block).

You end up with a exception like:

----> 1 import gensim

~/Projects/gensim/gensim-mmreader/gensim/__init__.py in <module>()
      4 """
      5
----> 6 from gensim import parsing, matutils, interfaces, corpora, models, similarities, summarization, utils  # noqa:F401
      7 import logging
      8

~/Projects/gensim/gensim-mmreader/gensim/matutils.py in <module>()
     33 try:
     34     # try loading cython version of mmreader
---> 35     from gensim.corpora._mmreader import MmReader
   
~/Projects/gensim/gensim-mmreader/gensim/corpora/__init__.py in <module>()
      6 from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
      7
----> 8 from .mmcorpus import MmCorpus  # noqa:F401
      9 from .bleicorpus import BleiCorpus  # noqa:F401
   
~/Projects/gensim/gensim-mmreader/gensim/corpora/mmcorpus.py in <module>()
     21
---> 22 class MmCorpus(matutils.MmReader, IndexedCorpus):
     23     """
   
AttributeError: module 'gensim.matutils' has no attribute 'MmReader'

I couldn't think of an easy way around this, short of moving _mmreader out of the corpora package, but I'll keep playing around with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's really sad, I understand what happens :( To avoid this, we need to move _mmreader from corpora (but I have no idea what's better place for it).

pass

def test_load(self):
self.assertRaises(ValueError, lambda: [doc for doc in self.corpus])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't catch, what is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking specifically about TestMmCorpusCorrupt? That's a test to confirm the code will raise an exception if the input file is malformed. The new code checks that sscanf reads in 3 values, so I just test that and raise an exception if fewer values are read.

As for the other tests, I added tests for:

  • the two compressed formats (gzip and bz2)
  • to make sure you can access docs by index for a corpus with an index
  • to check the values read back match what we expect (mostly, I was worried about floats getting truncated to ints)

@@ -215,10 +215,10 @@ def _get_slice(corpus, slice_):
self.assertRaises(RuntimeError, _get_slice, corpus_, 1.0)


class TestMmCorpus(CorpusTestCase):
class TestMmCorpusWithIndex(CorpusTestCase):
Copy link
Contributor

@menshikh-iv menshikh-iv Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's about backward compatibility tests? Or we have no problems here because we change the only reader?

Copy link
Contributor Author

@arlenk arlenk Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that all the tests pass with the existing version of MmReader as well, so there shouldn't be any backward compatibility issues.

@menshikh-iv menshikh-iv changed the title [WIP] simple cython version of mmreader Add Cython version of Mmreader Feb 21, 2018
@menshikh-iv menshikh-iv changed the title Add Cython version of Mmreader Add Cython version of MmReader Feb 21, 2018
@menshikh-iv
Copy link
Contributor

@CLearERR one more task for you (documenting), add this to "docsim PR", you need to go through files & cleanup docstrings after I merge this PR:

  • gensim/corpora/_mmreader.pyx
  • gensim/corpora/mmcorpus.py
  • gensim/matutils.py (MmReader & MmWriter)

Be accurate:

  • Don't forget to fetch fresh upstream with current changes
  • Don't forget to check how all rendered (especially for cython part).

@menshikh-iv
Copy link
Contributor

@arlenk great work! I fixed import-error, will merge this after CI if all fine, let's return to LDA cythonization PR.

@menshikh-iv menshikh-iv merged commit c3f08c1 into piskvorky:develop Feb 21, 2018
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
* simple cython version of mmreader

* confirm MmCorpus read correct values from corpus file

* remove unnecessary include dirs for _mmreader.c

* add numpy-style docstrings

* update conversion specifier for doubles for sscanf

* drop plain python version of mmreader

* additional unittests for mmcorpus

* add gzipped mmcorpus unittest

* add bzip compressed test corpus

* remove tox test for non-compiled word2vec

* add back tox test for non-compiled word2vec

* add dummy classes for tox's non-compiled word2vec check

* fix ordering of imports for gensim.corpora

* remove tox test for non-compiled code

* update MANIFEST to include cython mmreader

* add back python only version of MmReader

* mend

* mend

* move python version of MmReader back to matutils

* fix import problems

* fix import order
@arlenk arlenk deleted the mmreader branch February 23, 2018 02:17
@menshikh-iv
Copy link
Contributor

@arlenk great work, thank you very much 🔥!

@piskvorky
Copy link
Owner

@arlenk how did you create the text8 .mm corpus? Can you please share it? (or the code for how you generated it from the raw text8)

@arlenk
Copy link
Contributor Author

arlenk commented Mar 2, 2018

@arlenk how did you create the text8 .mm corpus? Can you please share it? (or the code for how you generated it from the raw text8)

Sure, here are the files (I had to gzip the index file as well so that github would allow me to attach it).
text8.mm.gz
text8.mm.index.gz

Here's the code I used for creating the files from the original text8 corpus

import gensim.downloader as api
import gensim.corpora as gc

import numpy as np
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

np.seterr(all='raise')

# load data
_corpus_path = '~/corpora/'

def save_corpus(name='text8'):
    texts = api.load(name)

    print("creating dictionary")
    dictionary = gc.Dictionary([[ word.lower() for word in text ] for text in texts ])
    outfile = os.path.join(_corpus_path, "{}.dict".format(name))
    print("saving dictionary to: {}".format(outfile))
    dictionary.save(outfile)

    print("converting to corpus")
    corpus = [ dictionary.doc2bow([ word.lower() for word in text]) for text in texts ]
    outfile = os.path.join(_corpus_path, "{}.mm".format(name))
    print("saving corpus to {}".format(outfile))
    gc.MmCorpus.serialize(outfile, corpus)

@arlenk
Copy link
Contributor Author

arlenk commented Mar 2, 2018

and here's the code for the tfidf version. I created this version just to make sure there were no issues with floating point numbers in the input file.

import gensim.models as gm
import gensim.corpora as gc
corpus = gc.MmCorpus('~/corpora/text8.mm',)
tfidf = gm.TfidfModel(corpus)
corpus_tfidf = [ tfidf[doc] for doc in corpus ]
gc.MmCorpus.serialize('~/corpora/text8.tfidf.mm', corpus_tfidf)

@piskvorky
Copy link
Owner

Thanks a lot @arlenk !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants