Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add sent2vec in Gensim #1458

Closed
wants to merge 2 commits into from
Closed

[WIP] Add sent2vec in Gensim #1458

wants to merge 2 commits into from

Conversation

souravsingh
Copy link
Contributor

Adds sent2vec algorithm as a wrapper.

Fiex #1376


def word_vec(self, word, use_norm=False):
"""
Accept a single word as input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use google docstring fromat (anywhere)

import logging
import tempfile
import os
import struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import

import numpy as np
from numpy import float32 as REAL, sqrt, newaxis
from gensim import utils
from gensim.models.keyedvectors import KeyedVectors, Vocab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import Vocab

from six import string_types

logger = logging.getLogger(__name__)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blank lines before class definition (anywhere)

Note that you **cannot continue training** after doing a replace. The model becomes
effectively read-only = you can only call `most_similar`, `similarity` etc.
"""
super(FastTextKeyedVectors, self).init_sims(replace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undefined FastTextKeyedVectors

cmd.append("-%s" % option)
cmd.append(str(value))

output = utils.check_output(args=cmd)
Copy link
Contributor

@menshikh-iv menshikh-iv Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output - unused variable

@menshikh-iv
Copy link
Contributor

I have a question - what's a difference between current PR and existing fasttext wrapper?

Also, please do several things

  • Write tests for you wrapper
  • Create Notebook with usage/comparison
  • Fix review comment

@souravsingh
Copy link
Contributor Author

As mentioned in the link here- https://github.com/epfml/sent2vec

The algorithm builds on FastText to create features and representations for short texts and sentences. You can say it is an extension of Word2Vec.

@menshikh-iv
Copy link
Contributor

@souravsingh But we already have Doc2Vec (aka ParagraphVectors) as "extension of w2v for texts", what's advantages of this approach?

@souravsingh
Copy link
Contributor Author

@menshikh-iv I will be conducting a benchmark between sent2vec and Doc2Vec on Wikipedia data.

Maybe @martinjaggi can have a better answer to your question?

@martinjaggi
Copy link

@menshikh-iv
doc2vec doesn't perform well, see for example the papers here comparing it to sent2vec: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval001.pdf and
https://arxiv.org/pdf/1703.02507

@souravsingh
i agree with @menshikh-iv that it would likely be best to keep this very similar to the fasttext wrapper, and reuse as much as possible. the only difference is the training algorithm used. note that we have just updated our code to be compatible with the newest version of fasttext models

@piskvorky
Copy link
Owner

This algo looks great, and the results seem far superior to doc2vec.

If these results can be replicated, I'm in favour of supplanting doc2vec with sent2vec in gensim. As with fastText, we ideally want a fast native implementation (not just a wrapper for C++).

@gojomo
Copy link
Collaborator

gojomo commented Jul 9, 2017

I have to take the indications of results "far superior to doc2vec" with some grains-of-salt. For example, in the original sent2vec paper, they only train their PV-DBOW/PV-DM models on a 900 megaword toronto-books corpus, with unclear metaparameters/metaoptimization – then reuse those models across all tasks. (And yet across all the tables, few of the models besides skip-thought, are good-performers when using toronto-books as training data.) Meanwhile they evaluate sent2vec with a 2 gigaword Twitter corpus and then a 20 gigaword Wikipedia corpus, with (presumably) careful choice of their own model's parameters. How well would PV (or other algorithms!) perform given the same training data & level of meta-optimization perform? There isn't any evidence.

(Similarly, the SemEval results for PV are limited to "PV-DBOW that uses the model from Lau & Baldwin [2016]" – a single Wikipedia-based model, which only loads in Lau's gensim fork, with unclear metaparameters/metaoptimization. That downloadable model is also suspiciously small - 1.4GB suggests less than all of Wikipedia may have been used for training.)

All that said, the innovations of sent2vec all seem useful and intuitively likely to help improve doc-vectors. As I understand the paper, the key changes seem to be (1) n-grams are also trained; (2) the window is always the full doc; (3) a small amount of dropout applied to just the n-grams.

These could be layered into the existing or a future unified gensim Doc2Vec model as preprocessing steps or new optional parameters. N-grams could be simulated today via preprocessing (that inserts synthesized n-grams into texts); setting a super-large window would approximate a full-document window. Drop-out might be a helpful new option even for word2vec and vanilla PV options. (There was another single-author paper, can't locate at the moment but was mentioned in a gensim github request – that like FastText & sent2vec was training doc-vecs as sums-of-word-vecs-with-dropout to lessen memory overhead in large corpuses.)

I have a hunch there's a meta-model out of which these are all parameterized instances. It may be useful to more precisely refer to the existing gensim implementation as "PV" so that "Doc2Vec" can be a generic umbrella name for the techniques.

@piskvorky
Copy link
Owner

piskvorky commented Jul 10, 2017

Agreed. A practical unbiased evaluation is a big part of the challenge here -- we definitely don't want to replace proven, optimized algorithms with a bird in the bush.

Chiseling a common structure out of these related methods sounds non-trivial but potentially very beneficial (as long as these abstractions don't compromise the performance). It will also allow some sanity in maintaining all that stuff.

@martinjaggi
Copy link

@gojomo indeed our first version of the paper didn't train on toronto books yet, but we have fixed this. the performance is very robust over all 3 corpora (wiki, twitter, toronto), and we have published all 3 pretrained models for comparison. we are not affiliated with the SemEval 2017 organizers, so their evaluation of sent2vec is independent confirmation with zero parameter tuning, even without using an optimized tokenizer. for PV not performing well, this seems to be a consistent picture by now, as PV is one of the standard baselines in many applications (despite the downside that inference is non-trivial, in contrast to sent2vec).
would be nice if you could share the other paper mentioned, could it be the siamese CBOW maybe?

@gojomo
Copy link
Collaborator

gojomo commented Jul 10, 2017

@martinjaggi But what PV metaparameters did you choose, and did you use as much effort in picking those as was used in picking the values in Table 5?

I can believe your techniques have helped – ngrams & larger windows & dropout all seem like good ideas. But without more details, I can't trust the magnitudes-of-improvement. (And further, without other measures of overhead - for example the effect of ngram expansion and giant windows on memory and training time – also hard to know if vanilla Doc2Vec might not still be preferable for some projects.)

That you've done a Toronto-Toronto apples-to-apples (corpus) comparison helps a little, but the metaoptimization issue remains. And, it seems sent2vec did really well across all evaluations when trained on the Twitter data... so why not give every other algorithm a chance to train on that data, too, for that apples-to-apples comparison?

My issue with the SemEval paper isn't one of affiliation. They downloaded one oddish pre-trained model - it's mentioned as one of the only 2 in the whole setup NOT from algorithm's originators. The size of the file looks incomplete to me, given the description of its origin. And in other discussions, I've highlighted several areas where the Lau & Baldwin evaluation of PV seems inconsistent.

(The latest SemEval paper you linked to, http://nlp.arizona.edu/SemEval-2017/pdf/SemEval001.pdf, on multilingual comparisons doesn't even seem to have sent2vec or PV-DBOW scores, so not sure what its relevance is.)

While I appreciate the attempt to benchmark off-the-shelf models, trained on generic data, on a variety of other specific datasets, that's not a typical way of using PV (or many similar methods) – where training on exactly the text-domain in which you intend to compare, so that their specific vocabularies/meanings are learned, is more typical.

I found that other paper - seems similar to 'siamese CBOW', but is titled "Efficient Vector Representation for Documents Through Corruption", https://openreview.net/pdf?id=B1Igu2ogg, by Minmin Chen. (Some code apparently based on a "-sentence-vectors" patch once released my Mikolov is at https://github.com/mchen24/iclr2017.)

@souravsingh
Copy link
Contributor Author

So do we wait until we have native FastText in Gensim before proceeding with the PR?

@martinjaggi
Copy link

@gojomo thanks a lot for the pointer! SemEval results are in their Table 14. our reported PV results are from [1]. very large window for CBOW is a good idea, and it's included in our code and experiments, but not enough for getting the improvements of sent2vec (hyperparams for CBOW were carefully tuned as well, dim = 600, ws = 10, ep = 5, lr = 0.07, and t = 10−5, section 4 of arxiv v1)

[1] Hill, Felix, Cho, Kyunghyun, and Korhonen, Anna. Learning Distributed Representations of Sentences from Unlabelled Data. In Proceedings of NAACL-HLT, February 2016.

@souravsingh is the wrapper here now mostly compatible with the fasttext wrapper? (both codes are extremely similar)

@gojomo
Copy link
Collaborator

gojomo commented Jul 11, 2017

@martinjaggi Aha, I'd overlooked SemEval's Table 14 as an "en-en" comparison. Still, it's evaluating a single underdocumented/unoptimized/suspiciously small downloaded PV-DBOW model not from the originators (or even heavy users) of the method. It also seems like participants were generally encouraged to use the STS-specific training data to prepare their models – but there's no evidence the PV-DBOW model used anything but generic Wikipedia article texts. So it looks to me like an unfair and unreliable evaluation.

Looking at Hill/Cho/Kyunghyun/Korhonen's "Learning Distributed Representations of Sentences from Unlabelled Data", it also has serious errors in evaluation. They only used 100-dimensions for their PV tests – very constrained, especially for modeling 70 million sentences, and for comparing against other models allowed 500-4800 dimensions. There's no mention of searching for optimal parameters other than dimension-size. But most seriously - fatally, in my opinion – they only used one epoch over the 70M sentences. I'm frankly surprised the model did anything at all with that little training. PV papers use 10-20 epochs or more.

I have further reservations with any "Toronto Books"-based evaluations: that corpus is ordered sentences from about 7000 mostly-fiction books from unpublished authors, with the largest single category mentioned being "Romance", with 2,865 amateur romance novels included. To use this data for semantic-similarity testing among other news/non-fiction sentences seems really fishy to me.

So, sent2vec with 600+ dimensions, 3-13 epochs, other creator-tuned metaparameters, and (in some cases) much more appropriate training data is compared against PV with just 100 dimensions, 1 epoch, no apparent other tuning, and only trained on a dataset of amateur fiction. That's hardly a fair comparison.

Given these problems, I'm unable to draw any conclusions about PV's relative performance based on your referenced works.

@martinjaggi
Copy link

always just best to just do some benchmarking. luckily you guys have all algorithms in gensim already, therefore a simple benchmark should be easy to set up, as @souravsingh has suggested. here is a convenient one for example: https://github.com/facebookresearch/SentEval (doesn't have STS 2017 yet, but most from previous years)

@souravsingh
Copy link
Contributor Author

We can wait for #1482 to finish before proceeding with this PR.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 14, 2017

@souravsingh #1482 never been finished, current fasttext PR - #1525.
What's a status of current PR?

@souravsingh
Copy link
Contributor Author

I am waiting on FastText model at #1525 to be merged(which should be soon). Once that is done, We can inherit the class from FastText and make some fixes.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for this wrapper

import logging

import numpy as np
from numpy import zeros, ones, vstack, sum as np_sum, empty, float32 as REAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of unused imports, looks like copy-paste

import numpy as np
from numpy import zeros, ones, vstack, sum as np_sum, empty, float32 as REAL

from gensim.models.word2vec import Word2Vec, train_sg_pair, train_cbow_pair
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused imports too

"""

def initialize_word_vectors(self):
self.wv = Sent2VecKeyedVectors()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Sent2VecKeyedVectors? I don't see this class anywhere.

@menshikh-iv
Copy link
Contributor

What's status here @souravsingh ?

@souravsingh
Copy link
Contributor Author

I will revisit the issue later once I have a concrete idea on the model. Closing the issue for now.

@souravsingh souravsingh closed this Oct 5, 2017
@souravsingh souravsingh deleted the add-sent2vec branch October 5, 2017 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants