Complete the implementation of SMART #2420

Witiko · 2019-03-17T01:00:52Z

This PR continues #1791 by completing the implementation of Salton's SMART Information Retrieval System in models.tfidf, and follows up on the discussion in the Gensim Google group from February. See the list of changes:

Make t an alias for the n term frequency method.
Make x an alias for the n document frequency method.
Implement the t document frequency method, and rename the existing t document frequency method to f.
Make x an alias for the n document length normalization method.
Implement the u and b pivoted document length normalization methods.
Produce a helpful error message when a SMART scheme in the ddd.qqq format is requested:

>>> from gensim.models import TfidfModel
>>> TfidfModel(smartirs='Lnu.nnn')
ValueError: The notation Lnu.nnn specifies two term-weighting schemes,
            one for collection documents (Lnu) and one for queries (nnn).
            You must train two separate tf-idf models.

These are our additions: * Make `t` an alias for the `n` term frequency method. * Implement the `f` document frequency method. * Rename `t` document frequency method to `f`. * Make `x` an alias for the `n` document frequency method. * Make `x` an alias for the `n` document length normalization method. * Implement the `u` pivoted document length normalization method. * Implement the `unique` vector norm to matutils.unitvec. * Produce a helpful error message when a SMART scheme in the `ddd.qqq` format is requested.

gensim/corpora/dictionary.py

Witiko · 2019-03-24T21:56:28Z

This PR should be ready for a review.

markroxor · 2019-04-04T09:27:20Z

@Witiko thanks for the changes, I wanted to add them since a long time. :)

mpenkov

Sorry for the late review. I'm not as familiar with this part of gensim, so my first review is mainly cosmetic. Please have a look and let me know if you have questions.

gensim/matutils.py

gensim/models/tfidfmodel.py

Closes piskvorky#2444.

…-smart

Witiko · 2019-05-05T11:25:40Z

Are there any other changes that you'd like to see, @piskvorky?

Witiko · 2019-05-08T17:18:20Z

What I'd like to see is changing the default SMART scheme of TfidfModel from nfc and ntc to something more recent, such as dtb suggested by Singhal, see our previous discussion. This would help eliminate weak baselines, but it would also break user code.

piskvorky · 2019-05-08T18:02:52Z

We can add a fat warning / recommendation box into the documentation, so people know that (hopefully) superior alternatives exist. But I'd be wary of changing defaults, especially for well-established methods like TFIDF.

Although the "pivoted normalization" project in particular was a massive flop in my estimation. I'd need to see a much more coherent discussion of its merits and modes of use, along with benchmarks, before doing anything with it. To be honest, I'm more leaning toward ripping it out of Gensim, because both the code and its tutorial documentation were seriously weak IIRC.

Witiko · 2019-05-08T18:23:23Z

But I'd be wary of changing defaults, especially for well-established methods like TFIDF.

TFIDF is a blanket term, which covers various weighting schemes including pivoted normalization. I agree that the TfidfModel class is established and changing its default weighting scheme may lead to confusion.

Although the "pivoted normalization" project in particular was a massive flop in my estimation. I'd need to see a much more coherent discussion of its merits and modes of use, along with benchmarks, before doing anything with it. To be honest, I'm more leaning toward ripping it out of Gensim, because both the code and its documentation (tutorial) were seriously weak IIRC.

Pivoted normalization unanimously improves the performance of TFIDF on the information retrieval task, as shown in the TREC SMART papers. See for example Table 3 in the TREC 8 paper, where the XXu and XXb weighting schemes use pivoted normalization:

If the published results are not persuasive, we can run our own benchmarks. If the original implementaton was weak, we can improve it. The latter is one of the aims of this pull request.

piskvorky · 2019-05-08T18:53:19Z

Thanks for the detailed follow-up @Witiko.

Yes, I always take academic results with a grain of salt, because of the dreaded "publish-or-perish". SOTA tables typically don't factor in additional complexity and algorithmic robustness, which is critical for real-world inputs (and that's only when they don't directly overfit to the SOTA dataset, which they often do). "Simplicity and sanity" ≫ "A few percent lift in accuracy".

But I believe the main issue with the original project was its lack of documentation and workflow motivation, more than the code. IIRC the pivoted normalization required the user to supply some barely-documented parameter. I asked the author about this back then, and they had no answer, rendering the whole project rather useless and academic. I doubt anyone's used that implementation in Gensim since.

If you could fix it (the docs, possibly code), that'd be great. The other option is removing it.

Witiko · 2019-05-08T18:56:56Z

Note that this pull request is a large step forward, because the pivoted normalization is now hidden behind the SMART API. Just specifying smart='dtb' uses pivoted normalization behind the scenes, although the user is free to tweak the slope parameter.

Manual pivoted normalization, where the user specifies both the pivot and slope parameters, seems rarely used in the wild, which is why the original implementation may have flopped.

piskvorky · 2019-05-08T19:03:13Z

OK, great. I'll have to re-read your updated docs, because our current docs:

are utterly impotent (and the linked blog post helps nothing). I wouldn't know what to set these parameters to myself => useless mode.
Just looking at it, I'm angry we ever merged that PR 😠 Gentler docs are needed.

Witiko · 2019-05-08T19:04:38Z

Currently, the updated docs specify that these parameters are overriden by SMART. I will give the documentation another read and see how we can improve it.

I can also write a follow-up article to rare-technologies.com/pivoted-document-length-normalisation, so that users see how to produce a strong baseline using the SMART API.

Witiko · 2019-05-17T23:13:07Z

@piskvorky I suggest the following text:

pivot (float, optional)

In pivoted document length normalization, the effective norm of a document is the weighted average of the old norm and a pivot: slope × old norm + [1.0 − slope] × pivot.

When pivot is None, smartirs specifies either the unique (u) or the character-length (b) pivoted document length normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise when pivot is None, pivoted document length normalization will be disabled. Default is None.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

slope (float, optional)

In pivoted document length normalization, the effective norm of a document is the weighted average of the old norm and a pivot: slope × old norm + [1.0 − slope] × pivot.

Setting slope to 0.0 uses only the pivot as the norm, and setting slope to 1.0 disables pivoted document length normalization. Singhal suggests setting slope between 0.2 and 0.3 for best results. Default is 0.25.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

piskvorky · 2019-05-18T09:04:23Z

Thanks @Witiko ! Much better. Can you please expand two things:

What is pivoted length normalization (not only the formula, but what purpose does it serve, where does it fit conceptually in the processing pipeline, who should care & why – a short sentence or two max, upfront, for context).
This paragraph:

When pivot is None, smartirs specifies either the unique (u) or the character-length (b) pivoted document length normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise when pivot is None, pivoted document length normalization will be disabled. Default is None.

is hard to parse. I think I got what you mean; how about this instead:

You can either specify the pivot value directly as a float, or leave it to be determined automatically.

To set pivot automatically, do all of the following three steps:

Leave pivot=None (default value)

Set either u or b normalization mode in the smartirs parameter.

Set either corpus or dictionary parameter (the pivot value will be determined automatically from the properties of that particular corpus or dictionary).

If pivot=None (default value) but steps 2. and 3. above are not met, pivot normalization is turned off (no pivot normalization).

If my reading of your explanation here is correct (is it?), the API seems unfortunate. pivot=None meaning both "ignore pivot" and "pivot is active and determined automatically", based on values of some other parameters, is not good API design. Can you think of a way to make that cleaner?

Is setting pivot=0.25 enough to active the "manual" pivoted normalizaton, or do I still have to do step 2?

I can also write a follow-up article to rare-technologies.com/pivoted-document-length-normalisation, so that users see how to produce a strong baseline using the SMART API.

That'd be great indeed! Then I'd link the two through, and promote this "blog series", once it offers clearer value to readers (ideally with some actionable, "obviously useful" example).

Witiko · 2019-05-24T15:55:06Z

@piskvorky:

What is pivoted length normalization (not only the formula, but what purpose does it serve, where does it fit conceptually in the processing pipeline, who should care & why – a short sentence or two max, upfront, for context).

We can add the following short explanation: “In information retrieval, TF-IDF is biased against long documents. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.”

If my reading of your explanation here is correct (is it?), the API seems unfortunate. pivot=None meaning both "ignore pivot" and "pivot is active and determined automatically", based on values of some other parameters, is not good API design. Can you think of a way to make that cleaner?

It is. We can make smartirs always override pivot, i.e. the value of pivot will be ignored if smartirs is specified. This is much more intuitive, since smartirs is a higher-level API. It is also consistent, because smartirs already overrides normalize, and it allows us to skip step 1 in the documentation:

Set either u or b document normalization mode in the smartirs parameter.
Set either corpus or dictionary parameter. The pivot will be determined automatically from the properties of that particular corpus or dictionary.

Is setting pivot=0.25 enough to active the "manual" pivoted normalizaton, or do I still have to do step 2?

Yes, you only need corpus or dictionary to automatically determine pivot. When you specify pivot manually, corpus or dictionary is not required. This is also the current behavior of Gensim, nothing changes here.

This is the current suggested text of the documentation:

pivot (float or None, optional)

In information retrieval, TF-IDF is biased against long documents [1]. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.

You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps:

Set either the u or b document normalization in the smartirs parameter.

Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary.

If pivot is None and you don't follow steps 1 and 2, then pivoted document length normalization will be disabled. Default is None.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

slope (float, optional)

In information retrieval, TF-IDF is biased against long documents [1]. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.

Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal [2] suggests setting the slope between 0.2 and 0.3 for best results. Default is 0.25.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

References

Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted Document Length Normalization. SIGIR Forum, 51, 176-184.

Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.

piskvorky · 2019-05-24T16:11:31Z

@mpenkov WDYT? Does this documentation upgrade make more sense to you, as a user?

Witiko · 2019-05-27T13:26:43Z

@piskvorky I think the new text strikes a nice balance: it is both brief and informative. If you or @mpenkov have no other suggestions, then I will commit the changes, so that the PR is ready for the 3.8.0 release.

That'd be great indeed! Then I'd link the two through, and promote this "blog series", once it offers clearer value to readers (ideally with some actionable, "obviously useful" example).

SMART was developed for information retrieval, but text classification is easier to discuss, so I have yet to decide if we should use one or the other (or both) for the examples. The idea is to start with BOW as a baseline and work our way through a couple of SMART schemes, ending up with a much stronger vector space model. We can top off by introducing non-orthogonality (i.e. word mover's distance and soft cosine measure), which will make the model even stronger.

Please, let me know when you'd like the article to be finished, what markup language I should use for the submission (and anything else that comes to mind) either here, or on Slack. If there are no objections, I would cross-post the article to Medium, so that the new SMART API gets more exposure.

mpenkov · 2019-05-30T03:07:32Z

Looks good to me.

Trying to work out why the Appveyor builds are failing. Once that's sorted, we can merge.

Witiko · 2019-05-30T12:57:30Z

@mpenkov Thank you for the feedback. We discussed the failing AppVeyor builds in #2497.

Witiko · 2019-05-31T18:32:45Z

@piskvorky @mpenkov I just pushed the documentation changes. The rendered documentation looks as follows. With the AppVeyor build fixed, this PR should be ready for merge.

mpenkov · 2019-07-07T09:05:32Z

OK, finally merged. @Witiko Thank you for your contribution!

Fix the example code for SparseTermSimilarityMatrix

f22084f

Witiko force-pushed the complete-smart branch 2 times, most recently from ebe67e7 to 8a41566 Compare March 17, 2019 02:56

Witiko force-pushed the complete-smart branch from 8a41566 to 541cbb3 Compare March 17, 2019 03:21

Witiko changed the title ~~Extend implementation of SMART in models.tfidf~~ Complete the implementation of SMART in models.tfidf Mar 17, 2019

Witiko changed the title ~~Complete the implementation of SMART in models.tfidf~~ Complete the implementation of SMART Mar 17, 2019

Add collection frequency attribute to gensim.corpora.Dictionary

269abf3

piskvorky reviewed Mar 24, 2019

View reviewed changes

gensim/corpora/dictionary.py Show resolved Hide resolved

Witiko force-pushed the complete-smart branch 4 times, most recently from d9d27e1 to 09c8e36 Compare March 24, 2019 19:39

Witiko added 2 commits March 24, 2019 21:15

Resolve SMART letter aliases in gensim.models.tfidf.resolve_weights

5b5c12f

Implement the b pivoted document length normalization method

3cd63d1

Witiko force-pushed the complete-smart branch from 09c8e36 to 3cd63d1 Compare March 24, 2019 20:16

piskvorky requested a review from mpenkov April 4, 2019 11:27

piskvorky assigned mpenkov Apr 4, 2019

Witiko mentioned this pull request Apr 11, 2019

Problem while passing wlocal function in tfidf model #2444

Closed

mpenkov requested changes Apr 20, 2019

View reviewed changes

Witiko force-pushed the complete-smart branch from ee0ed45 to 667a51c Compare April 23, 2019 13:09

Witiko added 6 commits April 23, 2019 16:45

Fix error message in unitvec

40fd9c4

Remove redundant comment in TfidfModel

533be4a

Fix TfidfModel.__getitem__ for callable self.normalize

08d51a1

Replace None checks with ducktyping in TfidfModel

76cdb86

Document and test wlocal parameter of TfidfModel

18d30cb

Closes piskvorky#2444.

Merge remote-tracking branch 'remotes/upstream/develop' into complete…

4b69090

…-smart

Witiko force-pushed the complete-smart branch from 77703cb to 4b69090 Compare April 23, 2019 14:58

Witiko added 2 commits May 7, 2019 18:07

Merge branch 'develop' into complete-smart

5bb926b

Document the default SMART scheme of TfidfModel

a2f4c7e

Witiko force-pushed the complete-smart branch 2 times, most recently from dfd6de5 to 0144332 Compare May 17, 2019 23:12

Witiko force-pushed the complete-smart branch 4 times, most recently from 2e96965 to 71f55e3 Compare May 18, 2019 08:37

Improve the documentation of slope and pivot

fccc5e5

Witiko force-pushed the complete-smart branch from 835d70f to c4bca8d Compare May 31, 2019 18:31

Merge remote-tracking branch 'upstream/develop' into complete-smart

e709b0d

Witiko force-pushed the complete-smart branch from c4bca8d to e709b0d Compare May 31, 2019 18:32

mpenkov merged commit 11eb5df into piskvorky:develop Jul 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete the implementation of SMART #2420

Complete the implementation of SMART #2420

Witiko commented Mar 17, 2019 •

edited

Loading

Witiko commented Mar 24, 2019

markroxor commented Apr 4, 2019

mpenkov left a comment

Witiko commented May 5, 2019

Witiko commented May 8, 2019 •

edited

Loading

piskvorky commented May 8, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

piskvorky commented May 8, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

piskvorky commented May 8, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

pivot (float, optional)

slope (float, optional)

piskvorky commented May 18, 2019 •

edited

Loading

Witiko commented May 24, 2019 •

edited

Loading

pivot (float or None, optional)

slope (float, optional)

References

piskvorky commented May 24, 2019 •

edited

Loading

Witiko commented May 27, 2019 •

edited

Loading

mpenkov commented May 30, 2019

Witiko commented May 30, 2019 •

edited

Loading

Witiko commented May 31, 2019

mpenkov commented Jul 7, 2019

Complete the implementation of SMART #2420

Complete the implementation of SMART #2420

Conversation

Witiko commented Mar 17, 2019 • edited Loading

Witiko commented Mar 24, 2019

markroxor commented Apr 4, 2019

mpenkov left a comment

Choose a reason for hiding this comment

Witiko commented May 5, 2019

Witiko commented May 8, 2019 • edited Loading

piskvorky commented May 8, 2019 • edited Loading

Witiko commented May 8, 2019 • edited Loading

piskvorky commented May 8, 2019 • edited Loading

Witiko commented May 8, 2019 • edited Loading

piskvorky commented May 8, 2019 • edited Loading

Witiko commented May 8, 2019 • edited Loading

Witiko commented May 17, 2019 • edited Loading

pivot (float, optional)

slope (float, optional)

piskvorky commented May 18, 2019 • edited Loading

Witiko commented May 24, 2019 • edited Loading

pivot (float or None, optional)

slope (float, optional)

References

piskvorky commented May 24, 2019 • edited Loading

Witiko commented May 27, 2019 • edited Loading

mpenkov commented May 30, 2019

Witiko commented May 30, 2019 • edited Loading

Witiko commented May 31, 2019

mpenkov commented Jul 7, 2019

Witiko commented Mar 17, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

piskvorky commented May 8, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

piskvorky commented May 8, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

piskvorky commented May 8, 2019 •

edited

Loading

Witiko commented May 8, 2019 •

edited

Loading

Witiko commented May 17, 2019 •

edited

Loading

piskvorky commented May 18, 2019 •

edited

Loading

Witiko commented May 24, 2019 •

edited

Loading

piskvorky commented May 24, 2019 •

edited

Loading

Witiko commented May 27, 2019 •

edited

Loading

Witiko commented May 30, 2019 •

edited

Loading