-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete the implementation of SMART #2420
Conversation
ebe67e7
to
8a41566
Compare
These are our additions: * Make `t` an alias for the `n` term frequency method. * Implement the `f` document frequency method. * Rename `t` document frequency method to `f`. * Make `x` an alias for the `n` document frequency method. * Make `x` an alias for the `n` document length normalization method. * Implement the `u` pivoted document length normalization method. * Implement the `unique` vector norm to matutils.unitvec. * Produce a helpful error message when a SMART scheme in the `ddd.qqq` format is requested.
d9d27e1
to
09c8e36
Compare
This PR should be ready for a review. |
@Witiko thanks for the changes, I wanted to add them since a long time. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. I'm not as familiar with this part of gensim, so my first review is mainly cosmetic. Please have a look and let me know if you have questions.
Are there any other changes that you'd like to see, @piskvorky? |
What I'd like to see is changing the default SMART scheme of |
We can add a fat warning / recommendation box into the documentation, so people know that (hopefully) superior alternatives exist. But I'd be wary of changing defaults, especially for well-established methods like TFIDF. Although the "pivoted normalization" project in particular was a massive flop in my estimation. I'd need to see a much more coherent discussion of its merits and modes of use, along with benchmarks, before doing anything with it. To be honest, I'm more leaning toward ripping it out of Gensim, because both the code and its tutorial documentation were seriously weak IIRC. |
TFIDF is a blanket term, which covers various weighting schemes including pivoted normalization. I agree that the
Pivoted normalization unanimously improves the performance of TFIDF on the information retrieval task, as shown in the TREC SMART papers. See for example Table 3 in the TREC 8 paper, where the If the published results are not persuasive, we can run our own benchmarks. If the original implementaton was weak, we can improve it. The latter is one of the aims of this pull request. |
Thanks for the detailed follow-up @Witiko. Yes, I always take academic results with a grain of salt, because of the dreaded "publish-or-perish". SOTA tables typically don't factor in additional complexity and algorithmic robustness, which is critical for real-world inputs (and that's only when they don't directly overfit to the SOTA dataset, which they often do). "Simplicity and sanity" ≫ "A few percent lift in accuracy". But I believe the main issue with the original project was its lack of documentation and workflow motivation, more than the code. IIRC the pivoted normalization required the user to supply some barely-documented parameter. I asked the author about this back then, and they had no answer, rendering the whole project rather useless and academic. I doubt anyone's used that implementation in Gensim since. If you could fix it (the docs, possibly code), that'd be great. The other option is removing it. |
Note that this pull request is a large step forward, because the pivoted normalization is now hidden behind the SMART API. Just specifying Manual pivoted normalization, where the user specifies both the |
OK, great. I'll have to re-read your updated docs, because our current docs: are utterly impotent (and the linked blog post helps nothing). I wouldn't know what to set these parameters to myself => useless mode. |
Currently, the updated docs specify that these parameters are overriden by SMART. I will give the documentation another read and see how we can improve it. I can also write a follow-up article to rare-technologies.com/pivoted-document-length-normalisation, so that users see how to produce a strong baseline using the SMART API. |
dfd6de5
to
0144332
Compare
@piskvorky I suggest the following text:
|
2e96965
to
71f55e3
Compare
Thanks @Witiko ! Much better. Can you please expand two things:
is hard to parse. I think I got what you mean; how about this instead:
If my reading of your explanation here is correct (is it?), the API seems unfortunate. Is setting
That'd be great indeed! Then I'd link the two through, and promote this "blog series", once it offers clearer value to readers (ideally with some actionable, "obviously useful" example). |
We can add the following short explanation: “In information retrieval, TF-IDF is biased against long documents. Pivoted document length normalization solves this problem by changing the norm of a document to
It is. We can make
Yes, you only need This is the current suggested text of the documentation:
|
@mpenkov WDYT? Does this documentation upgrade make more sense to you, as a user? |
@piskvorky I think the new text strikes a nice balance: it is both brief and informative. If you or @mpenkov have no other suggestions, then I will commit the changes, so that the PR is ready for the 3.8.0 release.
SMART was developed for information retrieval, but text classification is easier to discuss, so I have yet to decide if we should use one or the other (or both) for the examples. The idea is to start with BOW as a baseline and work our way through a couple of SMART schemes, ending up with a much stronger vector space model. We can top off by introducing non-orthogonality (i.e. word mover's distance and soft cosine measure), which will make the model even stronger. Please, let me know when you'd like the article to be finished, what markup language I should use for the submission (and anything else that comes to mind) either here, or on Slack. If there are no objections, I would cross-post the article to Medium, so that the new SMART API gets more exposure. |
Looks good to me. Trying to work out why the Appveyor builds are failing. Once that's sorted, we can merge. |
@piskvorky @mpenkov I just pushed the documentation changes. The rendered documentation looks as follows. With the AppVeyor build fixed, this PR should be ready for merge. |
OK, finally merged. @Witiko Thank you for your contribution! |
This PR continues #1791 by completing the implementation of Salton's SMART Information Retrieval System in
models.tfidf
, and follows up on the discussion in the Gensim Google group from February. See the list of changes:t
an alias for then
term frequency method.x
an alias for then
document frequency method.t
document frequency method, and rename the existingt
document frequency method tof
.x
an alias for then
document length normalization method.u
andb
pivoted document length normalization methods.ddd.qqq
format is requested: