getvocab issue using french text #30

ctoffo · 2019-10-24T15:03:07Z

Hello guys,

I applied getvocab on a french text with the following line
./fast getvocab marie_claire.txt > new_vocab

However, I have seen a bug (if it is a bug!) : some tokens are duplicated, with the second copied token written with a line break. Here an example (it's just a cut extract of the full initial vocab output) :

You can see et and de in the example above. Furthermore, the vocab starts exactly as reported : a line break, a space and the frequence (2439). Still a bug ?

Here the french text :
wget -O marie_claire.txt http://www.gutenberg.org/cache/epub/58501/pg58501.txt

Any idea ?

Thanks a lot for your help :)

The text was updated successfully, but these errors were encountered:

ctoffo closed this as completed Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getvocab issue using french text #30

getvocab issue using french text #30

ctoffo commented Oct 24, 2019

getvocab issue using french text #30

getvocab issue using french text #30

Comments

ctoffo commented Oct 24, 2019