Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getvocab issue using french text #30

Closed
ctoffo opened this issue Oct 24, 2019 · 0 comments
Closed

getvocab issue using french text #30

ctoffo opened this issue Oct 24, 2019 · 0 comments

Comments

@ctoffo
Copy link

ctoffo commented Oct 24, 2019

Hello guys,

I applied getvocab on a french text with the following line
./fast getvocab marie_claire.txt > new_vocab

However, I have seen a bug (if it is a bug!) : some tokens are duplicated, with the second copied token written with a line break. Here an example (it's just a cut extract of the full initial vocab output) :

Capture d’écran 2019-10-24 à 17 00 32

You can see et and de in the example above. Furthermore, the vocab starts exactly as reported : a line break, a space and the frequence (2439). Still a bug ?

Here the french text :
wget -O marie_claire.txt http://www.gutenberg.org/cache/epub/58501/pg58501.txt

Any idea ?

Thanks a lot for your help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant