-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New models trained with Google Word2Vec not processed correctly #2
Comments
Hi, Thank you for the input, it seems that it only loads ascii. I'll try to Best, Michael. On Tue, Feb 9, 2016 at 11:52 AM, cjmcmurtrie notifications@github.com
|
Hey, thanks for getting back Michael. I'll also be working on this and will let you know if I make any progress. |
I have an update regarding this. The models trained straight from the Google C codebase did not read correctly. However, the following steps made it possible to load them into Torch using your code, with a utf-8 unicode vocabulary: 1 . Train model with Google C scrip word2vec.c, save as binary.
3 . Load model in Torch with bintot7.lua. Following this procedure loads the word vectors correctly, for example:
Furthermore, inspecting the contents of the Google trained model shows that the vocabulary is lists of unicode character codes, rather than byte strings:
What do you think? Does this clarify anything at all? |
Hi cjmcmurtrie, This is great. Do you mind that I add your solution to the README? Thanks, Michael. |
Hi there, thanks for this very useful tool.
This seems to work perfectly with the pre-trained Google Word2Vec model, but I am having issues processing new models that I trained using that code.
The (saved as binary) models trained with word2vec.c work correctly in the demos implemented and provided by Mikolov in the package, eg:
However, when I try to port the models into my Torch programs, I get a dictionary of vectors such as the following:
It seems to me that the code in bintot7.lua is trying to process the binary strings into ascii rather than utf-8. In your code, are you explicitly decoding the binary strings to ascii, rather than utf-8/unicode? Do you know anything about this and how we could fix it?
The text was updated successfully, but these errors were encountered: