Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New models trained with Google Word2Vec not processed correctly #2

Open
cjmcmurtrie opened this issue Feb 9, 2016 · 4 comments
Open

Comments

@cjmcmurtrie
Copy link

Hi there, thanks for this very useful tool.

This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.

The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:

Enter word or sentence (EXIT to break): hello

Word: hello  Position in vocabulary: 3560

                                              Word       Cosine distance
------------------------------------------------------------------------
                                                hi      0.538164
                                               hey      0.469036
                                               *(?      0.401341
                                           pedants      0.396846

However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:

  D?+?<u?? : FloatTensor - size: 200
  xT?<? : FloatTensor - size: 200
  ????G>???? : FloatTensor - size: 200

It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?

@rotmanmi
Copy link
Owner

rotmanmi commented Feb 9, 2016

Hi,

Thank you for the input, it seems that it only loads ascii. I'll try to
identify UTF-8 encoding and implement it as well in the next few days.

Best,

Michael.

On Tue, Feb 9, 2016 at 11:52 AM, cjmcmurtrie notifications@github.com
wrote:

Hi there, thanks for this very useful tool.

This seems to work perfectly with the pre-trained Google Word2Vec model,
but I am having issues processing new models that I trained using that code.

The (saved as binary) models trained with word2vec.c work correctly in the
demos implemented and provided by Mikolov in the package, eg:

Enter word or sentence (EXIT to break): hello

Word: hello Position in vocabulary: 3560

                                          Word       Cosine distance

                                            hi      0.538164
                                           hey      0.469036
                                           *(?      0.401341
                                       pedants      0.396846

However, when I try to port the models into my Torch programs, I get a
dictionary of vectors such as the following:

D?+?<u?? : FloatTensor - size: 200
xT?<? : FloatTensor - size: 200
????G>???? : FloatTensor - size: 200

It seems to me that the code in bintot7.lua is trying to process the
binary strings into ascii rather than utf-8. In your code, are you
explicitly decoding the binary strings to ascii, rather than utf-8/unicode?
Do you know anything about this and how we could fix it?


Reply to this email directly or view it on GitHub
#2.

@cjmcmurtrie
Copy link
Author

Hey, thanks for getting back Michael. I'll also be working on this and will let you know if I make any progress.

@cjmcmurtrie
Copy link
Author

I have an update regarding this.

The models trained straight from the Google C codebase did not read correctly.

However, the following steps made it possible to load them into Torch using your code, with a utf-8 unicode vocabulary:

1 . Train model with Google C scrip word2vec.c, save as binary.
2 . Load model with Python package Gensim and save again with Gensim:

from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('full-russian//test-russian-vectors.bin', binary=True)
model.save_word2vec_format('full-russian//test-russian-vectors-gensimsaved.bin', binary=True)

3 . Load model in Torch with bintot7.lua.

Following this procedure loads the word vectors correctly, for example:

  второму : FloatTensor - size: 200
  уклонялся : FloatTensor - size: 200
  прозаического : FloatTensor - size: 200
  горьких : FloatTensor - size: 200

Furthermore, inspecting the contents of the Google trained model shows that the vocabulary is lists of unicode character codes, rather than byte strings:

print model.most_similar(['софьи'.decode('utf8')])
>>> [(u'\u0446\u0430\u0440\u0435\u0432\u043d\u044b', 0.5405951738357544), (u'\u0435\u0432\u0434\u043e\u043a\u0438\u044f', 0.4162743091583252), ...

What do you think? Does this clarify anything at all?

@rotmanmi
Copy link
Owner

rotmanmi commented Mar 7, 2016

Hi cjmcmurtrie,

This is great. Do you mind that I add your solution to the README?

Thanks,

Michael.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants