New models trained with Google Word2Vec not processed correctly #2

cjmcmurtrie · 2016-02-09T09:52:32Z

Hi there, thanks for this very useful tool.

This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.

The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:

Enter word or sentence (EXIT to break): hello

Word: hello  Position in vocabulary: 3560

                                              Word       Cosine distance
------------------------------------------------------------------------
                                                hi      0.538164
                                               hey      0.469036
                                               *(?      0.401341
                                           pedants      0.396846

However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:

  D?+?<u?? : FloatTensor - size: 200
  xT?<? : FloatTensor - size: 200
  ????G>???? : FloatTensor - size: 200

It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?

The text was updated successfully, but these errors were encountered:

rotmanmi · 2016-02-09T10:14:48Z

Hi,

Thank you for the input, it seems that it only loads ascii. I'll try to
identify UTF-8 encoding and implement it as well in the next few days.

Best,

Michael.

On Tue, Feb 9, 2016 at 11:52 AM, cjmcmurtrie notifications@github.com
wrote:

Hi there, thanks for this very useful tool.

This seems to work perfectly with the pre-trained Google Word2Vec model,
but I am having issues processing new models that I trained using that code.

The (saved as binary) models trained with word2vec.c work correctly in the
demos implemented and provided by Mikolov in the package, eg:

Enter word or sentence (EXIT to break): hello

Word: hello Position in vocabulary: 3560
                                          Word       Cosine distance
                                            hi      0.538164
                                           hey      0.469036
                                           *(?      0.401341
                                       pedants      0.396846
However, when I try to port the models into my Torch programs, I get a
dictionary of vectors such as the following:

D?+?<u?? : FloatTensor - size: 200
xT?<? : FloatTensor - size: 200
????G>???? : FloatTensor - size: 200

It seems to me that the code in bintot7.lua is trying to process the
binary strings into ascii rather than utf-8. In your code, are you
explicitly decoding the binary strings to ascii, rather than utf-8/unicode?
Do you know anything about this and how we could fix it?

—
Reply to this email directly or view it on GitHub
#2.

cjmcmurtrie · 2016-02-09T10:18:35Z

Hey, thanks for getting back Michael. I'll also be working on this and will let you know if I make any progress.

cjmcmurtrie · 2016-02-09T14:41:37Z

I have an update regarding this.

The models trained straight from the Google C codebase did not read correctly.

However, the following steps made it possible to load them into Torch using your code, with a utf-8 unicode vocabulary:

1 . Train model with Google C scrip word2vec.c, save as binary.
2 . Load model with Python package Gensim and save again with Gensim:

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('full-russian//test-russian-vectors.bin', binary=True)
model.save_word2vec_format('full-russian//test-russian-vectors-gensimsaved.bin', binary=True)

3 . Load model in Torch with bintot7.lua.

Following this procedure loads the word vectors correctly, for example:

  второму : FloatTensor - size: 200
  уклонялся : FloatTensor - size: 200
  прозаического : FloatTensor - size: 200
  горьких : FloatTensor - size: 200

Furthermore, inspecting the contents of the Google trained model shows that the vocabulary is lists of unicode character codes, rather than byte strings:

print model.most_similar(['софьи'.decode('utf8')])
>>> [(u'\u0446\u0430\u0440\u0435\u0432\u043d\u044b', 0.5405951738357544), (u'\u0435\u0432\u0434\u043e\u043a\u0438\u044f', 0.4162743091583252), ...

What do you think? Does this clarify anything at all?

rotmanmi · 2016-03-07T08:19:43Z

Hi cjmcmurtrie,

This is great. Do you mind that I add your solution to the README?

Thanks,

Michael.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New models trained with Google Word2Vec not processed correctly #2

New models trained with Google Word2Vec not processed correctly #2

cjmcmurtrie commented Feb 9, 2016

rotmanmi commented Feb 9, 2016

cjmcmurtrie commented Feb 9, 2016

cjmcmurtrie commented Feb 9, 2016

rotmanmi commented Mar 7, 2016

New models trained with Google Word2Vec not processed correctly #2

New models trained with Google Word2Vec not processed correctly #2

Comments

cjmcmurtrie commented Feb 9, 2016

rotmanmi commented Feb 9, 2016

cjmcmurtrie commented Feb 9, 2016

cjmcmurtrie commented Feb 9, 2016

rotmanmi commented Mar 7, 2016