Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete the implementation of SMART #2420

Merged
merged 20 commits into from
Jul 7, 2019
Merged

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Mar 17, 2019

This PR continues #1791 by completing the implementation of Salton's SMART Information Retrieval System in models.tfidf, and follows up on the discussion in the Gensim Google group from February. See the list of changes:

  • Make t an alias for the n term frequency method.
  • Make x an alias for the n document frequency method.
  • Implement the t document frequency method, and rename the existing t document frequency method to f.
  • Make x an alias for the n document length normalization method.
  • Implement the u and b pivoted document length normalization methods.
  • Produce a helpful error message when a SMART scheme in the ddd.qqq format is requested:
>>> from gensim.models import TfidfModel
>>> TfidfModel(smartirs='Lnu.nnn')
ValueError: The notation Lnu.nnn specifies two term-weighting schemes,
            one for collection documents (Lnu) and one for queries (nnn).
            You must train two separate tf-idf models.

@Witiko Witiko force-pushed the complete-smart branch 2 times, most recently from ebe67e7 to 8a41566 Compare March 17, 2019 02:56
These are our additions:

* Make `t` an alias for the `n` term frequency method.

* Implement the `f` document frequency method.

* Rename `t` document frequency method to `f`.

* Make `x` an alias for the `n` document frequency method.

* Make `x` an alias for the `n` document length normalization method.

* Implement the `u` pivoted document length normalization method.

* Implement the `unique` vector norm to matutils.unitvec.

* Produce a helpful error message when a SMART scheme in the `ddd.qqq`
  format is requested.
@Witiko Witiko changed the title Extend implementation of SMART in models.tfidf Complete the implementation of SMART in models.tfidf Mar 17, 2019
@Witiko Witiko changed the title Complete the implementation of SMART in models.tfidf Complete the implementation of SMART Mar 17, 2019
@Witiko Witiko force-pushed the complete-smart branch 4 times, most recently from d9d27e1 to 09c8e36 Compare March 24, 2019 19:39
@Witiko
Copy link
Contributor Author

Witiko commented Mar 24, 2019

This PR should be ready for a review.

@markroxor
Copy link
Contributor

@Witiko thanks for the changes, I wanted to add them since a long time. :)

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I'm not as familiar with this part of gensim, so my first review is mainly cosmetic. Please have a look and let me know if you have questions.

gensim/matutils.py Show resolved Hide resolved
gensim/matutils.py Outdated Show resolved Hide resolved
gensim/matutils.py Show resolved Hide resolved
gensim/matutils.py Show resolved Hide resolved
gensim/models/tfidfmodel.py Show resolved Hide resolved
gensim/models/tfidfmodel.py Outdated Show resolved Hide resolved
gensim/models/tfidfmodel.py Outdated Show resolved Hide resolved
gensim/models/tfidfmodel.py Outdated Show resolved Hide resolved
@Witiko
Copy link
Contributor Author

Witiko commented May 5, 2019

Are there any other changes that you'd like to see, @piskvorky?

@Witiko
Copy link
Contributor Author

Witiko commented May 8, 2019

What I'd like to see is changing the default SMART scheme of TfidfModel from nfc and ntc to something more recent, such as dtb suggested by Singhal, see our previous discussion. This would help eliminate weak baselines, but it would also break user code.

@piskvorky
Copy link
Owner

piskvorky commented May 8, 2019

We can add a fat warning / recommendation box into the documentation, so people know that (hopefully) superior alternatives exist. But I'd be wary of changing defaults, especially for well-established methods like TFIDF.

Although the "pivoted normalization" project in particular was a massive flop in my estimation. I'd need to see a much more coherent discussion of its merits and modes of use, along with benchmarks, before doing anything with it. To be honest, I'm more leaning toward ripping it out of Gensim, because both the code and its tutorial documentation were seriously weak IIRC.

@Witiko
Copy link
Contributor Author

Witiko commented May 8, 2019

But I'd be wary of changing defaults, especially for well-established methods like TFIDF.

TFIDF is a blanket term, which covers various weighting schemes including pivoted normalization. I agree that the TfidfModel class is established and changing its default weighting scheme may lead to confusion.

Although the "pivoted normalization" project in particular was a massive flop in my estimation. I'd need to see a much more coherent discussion of its merits and modes of use, along with benchmarks, before doing anything with it. To be honest, I'm more leaning toward ripping it out of Gensim, because both the code and its documentation (tutorial) were seriously weak IIRC.

Pivoted normalization unanimously improves the performance of TFIDF on the information retrieval task, as shown in the TREC SMART papers. See for example Table 3 in the TREC 8 paper, where the XXu and XXb weighting schemes use pivoted normalization:

Spectacle J32602

If the published results are not persuasive, we can run our own benchmarks. If the original implementaton was weak, we can improve it. The latter is one of the aims of this pull request.

@piskvorky
Copy link
Owner

piskvorky commented May 8, 2019

Thanks for the detailed follow-up @Witiko.

Yes, I always take academic results with a grain of salt, because of the dreaded "publish-or-perish". SOTA tables typically don't factor in additional complexity and algorithmic robustness, which is critical for real-world inputs (and that's only when they don't directly overfit to the SOTA dataset, which they often do). "Simplicity and sanity" ≫ "A few percent lift in accuracy".

But I believe the main issue with the original project was its lack of documentation and workflow motivation, more than the code. IIRC the pivoted normalization required the user to supply some barely-documented parameter. I asked the author about this back then, and they had no answer, rendering the whole project rather useless and academic. I doubt anyone's used that implementation in Gensim since.

If you could fix it (the docs, possibly code), that'd be great. The other option is removing it.

@Witiko
Copy link
Contributor Author

Witiko commented May 8, 2019

Note that this pull request is a large step forward, because the pivoted normalization is now hidden behind the SMART API. Just specifying smart='dtb' uses pivoted normalization behind the scenes, although the user is free to tweak the slope parameter.

Manual pivoted normalization, where the user specifies both the pivot and slope parameters, seems rarely used in the wild, which is why the original implementation may have flopped.

@piskvorky
Copy link
Owner

piskvorky commented May 8, 2019

OK, great. I'll have to re-read your updated docs, because our current docs:

Screen Shot 2019-05-08 at 21 01 02

are utterly impotent (and the linked blog post helps nothing). I wouldn't know what to set these parameters to myself => useless mode.
Just looking at it, I'm angry we ever merged that PR 😠 Gentler docs are needed.

@Witiko
Copy link
Contributor Author

Witiko commented May 8, 2019

Currently, the updated docs specify that these parameters are overriden by SMART. I will give the documentation another read and see how we can improve it.

I can also write a follow-up article to rare-technologies.com/pivoted-document-length-normalisation, so that users see how to produce a strong baseline using the SMART API.

@Witiko Witiko force-pushed the complete-smart branch 2 times, most recently from dfd6de5 to 0144332 Compare May 17, 2019 23:12
@Witiko
Copy link
Contributor Author

Witiko commented May 17, 2019

@piskvorky I suggest the following text:

pivot (float, optional)

In pivoted document length normalization, the effective norm of a document is the weighted average of the old norm and a pivot: slope × old norm + [1.0 − slope] × pivot.

When pivot is None, smartirs specifies either the unique (u) or the character-length (b) pivoted document length normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise when pivot is None, pivoted document length normalization will be disabled. Default is None.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

slope (float, optional)

In pivoted document length normalization, the effective norm of a document is the weighted average of the old norm and a pivot: slope × old norm + [1.0 − slope] × pivot.

Setting slope to 0.0 uses only the pivot as the norm, and setting slope to 1.0 disables pivoted document length normalization. Singhal suggests setting slope between 0.2 and 0.3 for best results. Default is 0.25.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

@Witiko Witiko force-pushed the complete-smart branch 4 times, most recently from 2e96965 to 71f55e3 Compare May 18, 2019 08:37
@piskvorky
Copy link
Owner

piskvorky commented May 18, 2019

Thanks @Witiko ! Much better. Can you please expand two things:

  1. What is pivoted length normalization (not only the formula, but what purpose does it serve, where does it fit conceptually in the processing pipeline, who should care & why – a short sentence or two max, upfront, for context).

  2. This paragraph:

When pivot is None, smartirs specifies either the unique (u) or the character-length (b) pivoted document length normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise when pivot is None, pivoted document length normalization will be disabled. Default is None.

is hard to parse. I think I got what you mean; how about this instead:

You can either specify the pivot value directly as a float, or leave it to be determined automatically.

To set pivot automatically, do all of the following three steps:

  1. Leave pivot=None (default value)
  2. Set either u or b normalization mode in the smartirs parameter.
  3. Set either corpus or dictionary parameter (the pivot value will be determined automatically from the properties of that particular corpus or dictionary).

If pivot=None (default value) but steps 2. and 3. above are not met, pivot normalization is turned off (no pivot normalization).

If my reading of your explanation here is correct (is it?), the API seems unfortunate. pivot=None meaning both "ignore pivot" and "pivot is active and determined automatically", based on values of some other parameters, is not good API design. Can you think of a way to make that cleaner?

Is setting pivot=0.25 enough to active the "manual" pivoted normalizaton, or do I still have to do step 2?

I can also write a follow-up article to rare-technologies.com/pivoted-document-length-normalisation, so that users see how to produce a strong baseline using the SMART API.

That'd be great indeed! Then I'd link the two through, and promote this "blog series", once it offers clearer value to readers (ideally with some actionable, "obviously useful" example).

@Witiko
Copy link
Contributor Author

Witiko commented May 24, 2019

@piskvorky:

  1. What is pivoted length normalization (not only the formula, but what purpose does it serve, where does it fit conceptually in the processing pipeline, who should care & why – a short sentence or two max, upfront, for context).

We can add the following short explanation: “In information retrieval, TF-IDF is biased against long documents. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.”

If my reading of your explanation here is correct (is it?), the API seems unfortunate. pivot=None meaning both "ignore pivot" and "pivot is active and determined automatically", based on values of some other parameters, is not good API design. Can you think of a way to make that cleaner?

It is. We can make smartirs always override pivot, i.e. the value of pivot will be ignored if smartirs is specified. This is much more intuitive, since smartirs is a higher-level API. It is also consistent, because smartirs already overrides normalize, and it allows us to skip step 1 in the documentation:

  1. Set either u or b document normalization mode in the smartirs parameter.
  2. Set either corpus or dictionary parameter. The pivot will be determined automatically from the properties of that particular corpus or dictionary.

Is setting pivot=0.25 enough to active the "manual" pivoted normalizaton, or do I still have to do step 2?

Yes, you only need corpus or dictionary to automatically determine pivot. When you specify pivot manually, corpus or dictionary is not required. This is also the current behavior of Gensim, nothing changes here.

This is the current suggested text of the documentation:

pivot (float or None, optional)

In information retrieval, TF-IDF is biased against long documents [1]. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.

You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps:

  1. Set either the u or b document normalization in the smartirs parameter.
  2. Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary.

If pivot is None and you don't follow steps 1 and 2, then pivoted document length normalization will be disabled. Default is None.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

slope (float, optional)

In information retrieval, TF-IDF is biased against long documents [1]. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.

Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal [2] suggests setting the slope between 0.2 and 0.3 for best results. Default is 0.25.

See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.

References

  1. Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted Document Length Normalization. SIGIR Forum, 51, 176-184.
  2. Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35-43.

@piskvorky
Copy link
Owner

piskvorky commented May 24, 2019

@mpenkov WDYT? Does this documentation upgrade make more sense to you, as a user?

@Witiko
Copy link
Contributor Author

Witiko commented May 27, 2019

@piskvorky I think the new text strikes a nice balance: it is both brief and informative. If you or @mpenkov have no other suggestions, then I will commit the changes, so that the PR is ready for the 3.8.0 release.

That'd be great indeed! Then I'd link the two through, and promote this "blog series", once it offers clearer value to readers (ideally with some actionable, "obviously useful" example).

SMART was developed for information retrieval, but text classification is easier to discuss, so I have yet to decide if we should use one or the other (or both) for the examples. The idea is to start with BOW as a baseline and work our way through a couple of SMART schemes, ending up with a much stronger vector space model. We can top off by introducing non-orthogonality (i.e. word mover's distance and soft cosine measure), which will make the model even stronger.

Please, let me know when you'd like the article to be finished, what markup language I should use for the submission (and anything else that comes to mind) either here, or on Slack. If there are no objections, I would cross-post the article to Medium, so that the new SMART API gets more exposure.

@mpenkov
Copy link
Collaborator

mpenkov commented May 30, 2019

Looks good to me.

Trying to work out why the Appveyor builds are failing. Once that's sorted, we can merge.

@Witiko
Copy link
Contributor Author

Witiko commented May 30, 2019

@mpenkov Thank you for the feedback. We discussed the failing AppVeyor builds in #2497.

@Witiko
Copy link
Contributor Author

Witiko commented May 31, 2019

@piskvorky @mpenkov I just pushed the documentation changes. The rendered documentation looks as follows. With the AppVeyor build fixed, this PR should be ready for merge.

scrot

@mpenkov mpenkov merged commit 11eb5df into piskvorky:develop Jul 7, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Jul 7, 2019

OK, finally merged. @Witiko Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants