Add smart information retrieval system for `TfidfModel`. Fix #1785 #1791

markroxor · 2017-12-15T10:43:57Z

For more information check issue #1785.

Tests have failed locally because lambda functions cannot be serialized by pickle.
Can't monkey patch regular functions either as it is not supported either.

TODO:

write backward compatibility tests.
Write docs-strings for each.

…into develop

menshikh-iv

Good start! Please be careful with more backward compatibility

menshikh-iv · 2017-12-15T12:22:11Z

gensim/models/tfidfmodel.py

-    def __init__(self, corpus=None, id2word=None, dictionary=None,
-                 wlocal=utils.identity, wglobal=df2idf, normalize=True):
+    def __init__(self, corpus=None, id2word=None, dictionary=None, smartirs="ntc",
+                 wlocal=None, wglobal=None, normalize=None):


Better to support backward compatibility, why you change default values?

Good solution - by default support default behavior and smartirs=None, but if user set smartirs - use it (and ignore wlocal, wglobal, ...)

It is backward compatible. Please check the smartirs value. :)

This can be followed but I refrained because I removed df2idf function totally because it is redundant. Are you sure I should add it back?

menshikh-iv · 2017-12-15T12:22:51Z

gensim/models/tfidfmodel.py

+    w_tf, w_df, w_n = smartirs
+
+    if w_tf not in 'nlabL':
+        raise ValueError('Expected term frequency weight to be one of nlabL, except got ' + w_tf)


better use 'nlabL' instead nlabL (readability), same for got ..

menshikh-iv · 2017-12-15T12:23:49Z

gensim/models/tfidfmodel.py

+
+        if self.wlocal is None:
+            if n_tf == "n":
+                self.wlocal = lambda tf, mean=None, _max=None: tf


better to use simple define (instead of lambda) for avoiding pickle problems (here and everywhere)

menshikh-iv · 2017-12-15T12:26:03Z

gensim/models/tfidfmodel.py

@@ -127,11 +162,6 @@ def initialize(self, corpus):

        # and finally compute the idf weights
        n_features = max(dfs) if dfs else 0
-        logger.info(


why you remove this?

This showed the progress of the pre - computation step. Now that the values are no longer being pre-computed I think we need to log this in a different way. How do you want it to go?

menshikh-iv · 2017-12-15T12:27:45Z

gensim/test/test_sklearn_api.py

@@ -498,7 +498,6 @@ def testPersistence(self):
        original_matrix = self.model.transform(original_bow)
        passed = numpy.allclose(loaded_matrix, original_matrix, atol=1e-1)
        self.assertTrue(passed)
-
    def testModelNotFitted(self):


Need to add more tests (for new functionality)

I have that in my checklist but before that I need to pass the already present tests.

piskvorky · 2017-12-16T17:16:04Z

gensim/models/tfidfmodel.py

+                elif n_tf == "b":
+                    return 1 if tf > 0 else 0
+                elif n_tf == "L":
+                    return (1 + math.log(tf)) / (1 + math.log(mean))


What happens if the value is none of the enumerated options? Will the error make sense to the user?

resolve_weights take care of that.

piskvorky · 2017-12-16T17:17:45Z

gensim/models/tfidfmodel.py

        vector = [
-            (termid, self.wlocal(tf) * self.idfs.get(termid))
-            for termid, tf in bow if self.idfs.get(termid, 0.0) != 0.0
+            (termid, self.wlocal(tf, mean=np.mean(np.array(bow), axis=1), _max=np.max(bow, axis=1)) * self.wglobal(self.dfs[termid], self.num_docs))


This looks wasteful (creating arrays, only to throw them away). What are the performance implications of these changes? Do you have a benchmark before/after?

Changed the approach.

markroxor · 2017-12-16T20:48:35Z

gensim/models/tfidfmodel.py

@@ -77,11 +95,69 @@ def __init__(self, corpus=None, id2word=None, dictionary=None,
        If `dictionary` is specified, it must be a `corpora.Dictionary` object
        and it will be used to directly construct the inverse document frequency
        mapping (then `corpus`, if specified, is ignored).
+
+        `smartirs` or SMART (System for the Mechanical Analysis and Retrieval of Text)


Taken from wikipedia.

menshikh-iv · 2017-12-19T10:17:04Z

gensim/models/tfidfmodel.py

+    w_tf, w_df, w_n = smartirs
+
+    if w_tf not in 'nlabL':
+        raise ValueError('Expected term frequency weight to be one of \'nlabL\', except got ' + w_tf + '\'')


nitpick: you should use raise ValueError("...'nlabL'...") for avoiding \.

menshikh-iv · 2017-12-19T10:19:27Z

gensim/models/tfidfmodel.py

    """
-    return add + math.log(1.0 * totaldocs / docfreq, log_base)
+    return add + np.log(float(totaldocs) / docfreq) / np.log(2)


log_base doesn't used here

menshikh-iv · 2017-12-19T10:20:43Z

gensim/models/tfidfmodel.py

-    def __init__(self, corpus=None, id2word=None, dictionary=None,
-                 wlocal=utils.identity, wglobal=df2idf, normalize=True):
+    def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.identity,
+                 wglobal=df2idf, normalize=True, smartirs=None):
        """
        Compute tf-idf by multiplying a local component (term frequency) with a


Can you convert all docstrings in this file to numpy-style, according to my previous comment #1780 (comment)

menshikh-iv · 2017-12-19T10:22:15Z

gensim/models/tfidfmodel.py

        self.num_docs, self.num_nnz, self.idfs = None, None, None
+        self.smartirs = smartirs
+
+        if self.normalize is True:


What happen if self.normalize isn't bool?

In that case if smartirs is not None, self.normalize will be defined by the values of smartirs. If smartirs is None and self.normalize is not a bool then self.normalize must be a function. Just like wlocal and wglobal.

menshikh-iv

some tips about docstrings

menshikh-iv · 2017-12-22T04:12:36Z

gensim/test/test_sklearn_api.py

@@ -973,13 +973,13 @@ def testTransform(self):

    def testSetGetParams(self):


Don't forget to add more tests (also, check situations, when you pass smartirs and wlocal for example)

menshikh-iv · 2017-12-22T05:26:23Z

gensim/models/tfidfmodel.py


 logger = logging.getLogger(__name__)


+def resolve_weights(smartirs):


docstrings needed too (for all stuff here)

I think that Checks for validity of smartirs parameter. is enough. Do you have anything else in mind as well?

@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)

menshikh-iv · 2017-12-22T05:26:56Z

gensim/models/tfidfmodel.py

    # not strictly necessary and could be computed on the fly in TfidfModel__getitem__.
    # this method is here just to speed things up a little.
    return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}


+def wlocal_g(tf, n_tf):  # TODO rename it (to avoid confusion)


Don't forget about renaming

menshikh-iv · 2017-12-22T05:27:12Z

gensim/models/tfidfmodel.py

+        return x
+    elif n_n == "c":
+        return matutils.unitvec(x)
+    # TODO write byte-size normalisation


need to fix it too

menshikh-iv · 2017-12-22T05:30:53Z

gensim/models/tfidfmodel.py

+                    If `dictionary` is specified, it must be a `corpora.Dictionary` object
+                    and it will be used to directly construct the inverse document frequency
+                    mapping (then `corpus`, if specified, is ignored).
+        wlocals :   user specified function


Instead of

wlocals : user specified function Default for `wlocal` is identity (other options: math.sqrt, math.log1p, ...)

should be

wlocals : function, optional description of parameter, with links to different implementations (if function defined in gensim, link should be like :func:`~gensim.some.func`)

everywhere

menshikh-iv · 2017-12-22T05:31:43Z

gensim/models/tfidfmodel.py

-        and returns a sparse vector.
+        Parameters
+        ----------
+        corpus :    dictionary.doc2bow


type should be iterable of iterable of (int, int)

menshikh-iv · 2017-12-22T05:32:23Z

gensim/models/tfidfmodel.py

+        id2word :   dict
+                    id2word is an optional dictionary that maps the word_id to a token.
+                    In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.
+        dictionary :corpora.Dictionary


type should be

:class:`~gensim.corpora.dictionary.Dictionary`

menshikh-iv · 2017-12-22T05:32:52Z

gensim/models/tfidfmodel.py

+                    `normalize=True` means set to unit length (default); `False` means don't
+                    normalize. You can also set `normalize` to your own function that accepts
+                    and returns a sparse vector.
+        smartirs : {'None' ,'str'}


str, optional

menshikh-iv · 2017-12-22T05:33:19Z

gensim/models/tfidfmodel.py

+
+                    for more information visit https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
+
+        Returns


Returns no needed in __init__ method

menshikh-iv

Looks good @markroxor 👍

menshikh-iv · 2017-12-25T09:51:06Z

gensim/models/tfidfmodel.py


 logger = logging.getLogger(__name__)


+def resolve_weights(smartirs):


@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)

menshikh-iv · 2017-12-25T09:51:49Z

gensim/models/tfidfmodel.py

    """
-    return add + math.log(1.0 * totaldocs / docfreq, log_base)
+    return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)


What's a reason to use np.log instead of math.log?

Consistency, I am using np everywhere. Anyways I do'nt see any problem with that. Do you?

No problem this is only a question :)

menshikh-iv · 2017-12-25T09:52:29Z

gensim/models/tfidfmodel.py

+        return (1 + np.log(tf) / np.log(2)) / (1 + np.log(tf.mean(axis=0) / np.log(2)))
+
+
+def updated_wglobal(docfreq, totaldocs, n_df):  # TODO rename it (to avoid confusion)


please remove TODO (about renaming)

menshikh-iv · 2017-12-25T09:55:38Z