Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smart information retrieval system for TfidfModel. Fix #1785 #1791

Merged
merged 34 commits into from
Jan 15, 2018

Conversation

markroxor
Copy link
Contributor

@markroxor markroxor commented Dec 15, 2017

For more information check issue #1785.

Tests have failed locally because lambda functions cannot be serialized by pickle.
Can't monkey patch regular functions either as it is not supported either.

TODO:

  • write backward compatibility tests.
  • Write docs-strings for each.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! Please be careful with more backward compatibility

def __init__(self, corpus=None, id2word=None, dictionary=None,
wlocal=utils.identity, wglobal=df2idf, normalize=True):
def __init__(self, corpus=None, id2word=None, dictionary=None, smartirs="ntc",
wlocal=None, wglobal=None, normalize=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to support backward compatibility, why you change default values?

Good solution - by default support default behavior and smartirs=None, but if user set smartirs - use it (and ignore wlocal, wglobal, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is backward compatible. Please check the smartirs value. :)
  • This can be followed but I refrained because I removed df2idf function totally because it is redundant. Are you sure I should add it back?

w_tf, w_df, w_n = smartirs

if w_tf not in 'nlabL':
raise ValueError('Expected term frequency weight to be one of nlabL, except got ' + w_tf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use 'nlabL' instead nlabL (readability), same for got ..


if self.wlocal is None:
if n_tf == "n":
self.wlocal = lambda tf, mean=None, _max=None: tf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use simple define (instead of lambda) for avoiding pickle problems (here and everywhere)

@@ -127,11 +162,6 @@ def initialize(self, corpus):

# and finally compute the idf weights
n_features = max(dfs) if dfs else 0
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This showed the progress of the pre - computation step. Now that the values are no longer being pre-computed I think we need to log this in a different way. How do you want it to go?

@@ -498,7 +498,6 @@ def testPersistence(self):
original_matrix = self.model.transform(original_bow)
passed = numpy.allclose(loaded_matrix, original_matrix, atol=1e-1)
self.assertTrue(passed)

def testModelNotFitted(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add more tests (for new functionality)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have that in my checklist but before that I need to pass the already present tests.

elif n_tf == "b":
return 1 if tf > 0 else 0
elif n_tf == "L":
return (1 + math.log(tf)) / (1 + math.log(mean))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the value is none of the enumerated options? Will the error make sense to the user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_weights take care of that.

vector = [
(termid, self.wlocal(tf) * self.idfs.get(termid))
for termid, tf in bow if self.idfs.get(termid, 0.0) != 0.0
(termid, self.wlocal(tf, mean=np.mean(np.array(bow), axis=1), _max=np.max(bow, axis=1)) * self.wglobal(self.dfs[termid], self.num_docs))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wasteful (creating arrays, only to throw them away). What are the performance implications of these changes? Do you have a benchmark before/after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the approach.

@@ -77,11 +95,69 @@ def __init__(self, corpus=None, id2word=None, dictionary=None,
If `dictionary` is specified, it must be a `corpora.Dictionary` object
and it will be used to directly construct the inverse document frequency
mapping (then `corpus`, if specified, is ignored).

`smartirs` or SMART (System for the Mechanical Analysis and Retrieval of Text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken from wikipedia.

w_tf, w_df, w_n = smartirs

if w_tf not in 'nlabL':
raise ValueError('Expected term frequency weight to be one of \'nlabL\', except got ' + w_tf + '\'')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: you should use raise ValueError("...'nlabL'...") for avoiding \.

"""
return add + math.log(1.0 * totaldocs / docfreq, log_base)
return add + np.log(float(totaldocs) / docfreq) / np.log(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_base doesn't used here

def __init__(self, corpus=None, id2word=None, dictionary=None,
wlocal=utils.identity, wglobal=df2idf, normalize=True):
def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.identity,
wglobal=df2idf, normalize=True, smartirs=None):
"""
Compute tf-idf by multiplying a local component (term frequency) with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you convert all docstrings in this file to numpy-style, according to my previous comment #1780 (comment)

self.num_docs, self.num_nnz, self.idfs = None, None, None
self.smartirs = smartirs

if self.normalize is True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if self.normalize isn't bool?

Copy link
Contributor Author

@markroxor markroxor Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case if smartirs is not None, self.normalize will be defined by the values of smartirs. If smartirs is None and self.normalize is not a bool then self.normalize must be a function. Just like wlocal and wglobal.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some tips about docstrings

@@ -973,13 +973,13 @@ def testTransform(self):

def testSetGetParams(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add more tests (also, check situations, when you pass smartirs and wlocal for example)


logger = logging.getLogger(__name__)


def resolve_weights(smartirs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstrings needed too (for all stuff here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Checks for validity of smartirs parameter. is enough. Do you have anything else in mind as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)

# not strictly necessary and could be computed on the fly in TfidfModel__getitem__.
# this method is here just to speed things up a little.
return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}


def wlocal_g(tf, n_tf): # TODO rename it (to avoid confusion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about renaming

return x
elif n_n == "c":
return matutils.unitvec(x)
# TODO write byte-size normalisation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix it too

If `dictionary` is specified, it must be a `corpora.Dictionary` object
and it will be used to directly construct the inverse document frequency
mapping (then `corpus`, if specified, is ignored).
wlocals : user specified function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of

        wlocals :   user specified function
                         Default for `wlocal` is identity (other options: math.sqrt, math.log1p, ...)

should be

        wlocals : function, optional
            description of parameter, with links to different implementations (if function defined in gensim, link should be like :func:`~gensim.some.func`)

everywhere

and returns a sparse vector.
Parameters
----------
corpus : dictionary.doc2bow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type should be iterable of iterable of (int, int)

id2word : dict
id2word is an optional dictionary that maps the word_id to a token.
In case id2word isn’t specified the mapping id2word[word_id] = str(word_id) will be used.
dictionary :corpora.Dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type should be

:class:`~gensim.corpora.dictionary.Dictionary`

`normalize=True` means set to unit length (default); `False` means don't
normalize. You can also set `normalize` to your own function that accepts
and returns a sparse vector.
smartirs : {'None' ,'str'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str, optional


for more information visit https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System

Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns no needed in __init__ method

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @markroxor 👍


logger = logging.getLogger(__name__)


def resolve_weights(smartirs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)

"""
return add + math.log(1.0 * totaldocs / docfreq, log_base)
return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a reason to use np.log instead of math.log?

Copy link
Contributor Author

@markroxor markroxor Dec 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency, I am using np everywhere. Anyways I do'nt see any problem with that. Do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem this is only a question :)

return (1 + np.log(tf) / np.log(2)) / (1 + np.log(tf.mean(axis=0) / np.log(2)))


def updated_wglobal(docfreq, totaldocs, n_df): # TODO rename it (to avoid confusion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove TODO (about renaming)

# nnn
model = tfidfmodel.TfidfModel(self.corpus, smartirs='nnn')
transformed_docs = [model[docs[0]], model[docs[1]]]
expected_docs = [[(3, 2), (4, 2), (5, 3), (6, 2), (7, 3), (8, 2)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use hanging indents instead of vertical (here and everywhere)

model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst'))
self.assertTrue(model3.idfs == model4.idfs)
tstvec = []
self.assertTrue(np.allclose(model3[tstvec], model4[tstvec])) # try projecting an empty vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check non-empty case too

model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2'))
self.assertTrue(model3.idfs == model4.idfs)
tstvec = []
self.assertTrue(np.allclose(model3[tstvec], model4[tstvec])) # try projecting an empty vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as in testPersistence


# Test persistence between old and new compressed model.
model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better if you add gensim version in model name ("old" in comment isn't really informative)

@@ -58,6 +74,13 @@ def testPersistence(self):
tstvec = []
self.assertTrue(np.allclose(model[tstvec], model2[tstvec])) # try projecting an empty vector

# Test persistence between old and new model.
model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, need to test than models with smartirs save/load correctly

expected_docs = [model[docs[0]], model[docs[1]]]

self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
# endclass TestTfidfModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this line # ...

@markroxor
Copy link
Contributor Author

@markroxor need to add "Parameters" (type, description), "Raises" (type, reason), "Returns" (type, description)

I don't think that it is required. As this is part of docstrings and this function is implicitly used and can not be invoked by the user.

Are you sure I should add docstrings?

@menshikh-iv
Copy link
Contributor

@markroxor I understand what you mean, very desirable add docstrings for all stuff (this isn't only for public-api users, this is for contributors too), this isn't critical as public API, but desirable.

@@ -272,7 +272,7 @@ def __getitem__(self, bow, eps=1e-12):

vector = [
(termid, tf * self.idfs.get(termid))
for termid, tf in zip(termid_array, tf_array) if self.idfs.get(termid, 0.0) != 0.0
for termid, tf in zip(termid_array, tf_array) if self.idfs.get(termid, 0.0) > eps
Copy link
Owner

@piskvorky piskvorky Dec 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing in absolute value safer (easier to read and comprehend intent; plus less room for error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a discussion with @menshikh-iv stating otherwise. @menshikh-iv I am reverting this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky I did the same here, but you asked me to revert it.

@markroxor markroxor changed the title Add smart information retrieval system for TFIDF fix #1785 Add smart information retrieval system for TFIDF fix #1785 [MRG] Dec 28, 2017
(termid, self.wlocal(tf) * self.idfs.get(termid))
for termid, tf in bow if self.idfs.get(termid, 0.0) != 0.0
(termid, tf * self.idfs.get(termid))
for termid, tf in zip(termid_array, tf_array) if self.idfs.get(termid, 0.0) != 0.0
Copy link
Owner

@piskvorky piskvorky Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle; better compare floats for equality using eps instead. For our purposes, two floats are "equal" if their absolute difference is smaller than eps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, #1791 (comment) and #1791 (comment) contradict each other, what you mean @piskvorky?

Copy link
Owner

@piskvorky piskvorky Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compare floats for equality, use abs(x - y) < eps. Not x == y, nor x - y < eps.

@menshikh-iv menshikh-iv changed the title Add smart information retrieval system for TFIDF fix #1785 [MRG] Add smart information retrieval system for TfidfModel. Fix #1785 [MRG] Jan 11, 2018
@menshikh-iv menshikh-iv changed the title Add smart information retrieval system for TfidfModel. Fix #1785 [MRG] Add smart information retrieval system for TfidfModel. Fix #1785 Jan 11, 2018
@menshikh-iv menshikh-iv merged commit 0bfb9da into piskvorky:develop Jan 15, 2018
@markroxor markroxor deleted the smartirs branch January 15, 2018 17:36
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
…y#1785 (piskvorky#1791)

* fixing appveyor

* verify weights

* verify weights

* smartirs ready

* change old tests

* remove lambdas

* address suggestions

* minor fix

* pep8 fix

* fix pickle problem

* flake8 fix

* fix bug in docstring

* added few tests

* fix normalize issue for pickling

* fix normalize issue for pickling

* test without sklearn api

* hanging idents and new tests

* add docstring

* add docstring

* better way cmparing floats

* old way of cmp floats

* doc fix[1]

* doc fix[2]

* fix description TODOs

* fix irksome comparision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants