Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Str to byte fix for DTM wrapper. #768

Merged
merged 3 commits into from
Jul 1, 2016
Merged

Conversation

bhargavvader
Copy link
Contributor

@piskvorky , this is with respect to #698.

You were right, all that had to be done was use utils.to_utf8 before writing to file.
Works fine with both python2 and 3.

@bhargavvader
Copy link
Contributor Author

Note: would ideally want to convert to bytes from the list itself instead of converting to string and then to byte.

@bhargavvader
Copy link
Contributor Author

@tmylk , I can't directly convert from integer to bytes while writing to file, it has to be to string first.
For e.g, bytes(4) is b'\x00\x00\x00\x00', but str(4).encode('utf-8') is b'4' as expected.

Is there any idea for a workaround to this?

@tmylk tmylk merged commit 6980987 into piskvorky:develop Jul 1, 2016
@@ -173,9 +173,9 @@ def convert_input(self, corpus, time_slices):
corpora.BleiCorpus.save_corpus(self.fcorpustxt(), corpus)

with utils.smart_open(self.ftimeslices(), 'wb') as fout:
fout.write(six.u(str(len(self.time_slices)) + "\n"))
fout.write(six.u(utils.to_utf8(str(len(self.time_slices)) + "\n")))
Copy link
Owner

@piskvorky piskvorky Jul 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this. Doesn't six.u return unicode? How does this work @tmylk ?

@piskvorky
Copy link
Owner

piskvorky commented Jul 2, 2016

@bhargavvader we want to store binary strings, not "bytes" as such. We don't want to be storing numbers as raw bytes here, as in b'\x00\x00\x00\x04'. We want to store their (binary) string representation, as in b'4'. We just don't want to be storing these strings as unicode, as in u'4'.

Integers are always representable in ASCII encoding (subset of utf8), so the bytes/unicode conversion is no problem.

@bhargavvader bhargavvader deleted the DTM_bug branch July 2, 2016 08:26
@bhargavvader
Copy link
Contributor Author

bhargavvader commented Jul 2, 2016

@piskvorky , here it says that six.u returns unicode in python 2 and string in python 3.

I understand we'd want to store it in binary strings, yeah.
So can the six.u be removed? Can it just be utils.to_utf8(str()) in all places?

Edit: Put in a PR and removed six.u. Works fine.

@piskvorky
Copy link
Owner

I don't understand how this version with storing unicode to binary files even worked. It means our unit tests must be faulty / incomplete.

@klvbdmh
Copy link

klvbdmh commented Aug 10, 2016

Were the changes pushed to PyPI? Pip installs version 0.13.1. and the bug still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants