Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation faults with a small corpus #37

Closed
nricke opened this issue Jul 10, 2015 · 4 comments
Closed

Segmentation faults with a small corpus #37

nricke opened this issue Jul 10, 2015 · 4 comments

Comments

@nricke
Copy link

nricke commented Jul 10, 2015

Hi,

I can't get KenLM working on my corpus.

I've followed the usual steps:
./bin/lmplz -T /tmp/ --text corpus.txt --arpa myarpa.arpa
./bin/build_binary myarpa.arpa my_probing_model.mmap

Then I tried the snippet from here:
https://kheafield.com/code/kenlm/developers/

With a TrieModel, it always ends with a segfault, regardless of MAX_ORDER. The error occurs here:

lm::ngram::trie::TrieSearch<lm::ngram::DontQuantize, lm::ngram::trie::DontBhiksha>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) ()

With a ProbingModel, I get a segfault only for MAX_ORDER < 5:

lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::ResumeScore(unsigned int const*, unsigned int const*, unsigned char, unsigned long&, float*, unsigned char&, lm::FullScoreReturn&)

For MAX_ORDER = 5, the C++ program runs only with a couple of Valgrind errors:

==3445== Invalid write of size 8
==3445==    at 0x411B1A: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x409920: lm::ngram::ProbingModel::ProbingModel(char const*, lm::ngram::Config const&) (model.hh:136)

Invalid write of size 8
==3445==    at 0x43A06B: lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411515: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::SetupMemory(void*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411FC0: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)

But a JNA wrapper around the same snippet raises a "malloc(): memory corruption" when loading the model.

I tried with and without pruning, with order 2 and 3, both with KenLM from the download section and this of github. The size of the corpus is about 1Gb.
One peculiarity of the vocabulary is that it contains A LOT of words that are substring of other words of the vocabulary.

I'm aware that it's probably not enough information for proper debugging, but I would be interested to know whether the valgrind errors are ok and if you can suggest me anything to help me find the problem.

My system is Mint 17. The compilation succeeded with no warning.

@kpu
Copy link
Owner

kpu commented Jul 10, 2015

Does the query program work as expected?

There are a lot of unaligned accesses by design; are you using x86_64?

@kpu
Copy link
Owner

kpu commented Jul 10, 2015

Oh also if you specify a vocab id that's out of range, it reserves the right to segfault.

@nricke
Copy link
Author

nricke commented Jul 13, 2015

Thanks!

I fixed my issue with TrieModel. Calling RecognizeBinary() before new TrieModel() did the trick.

As for ProbingModel, if the Valgrind's invalid writes are expected, then it's fine. I guess JNA or the JVM are to blame for the segfaults.

FYI, on the TrieModel, the JVM complains here:

lm::ngram::trie::BitPackedMiddle<lm::ngram::trie::DontBhiksha>::Find(unsigned int, lm::ngram::trie::NodeRange&, unsigned long&)

I am checking the vocab id with index != model->GetVocabulary().NotFound().

I'm using a x86_64 indeed. The query program is working well.

@nricke nricke closed this as completed Jul 13, 2015
@nricke
Copy link
Author

nricke commented Jul 13, 2015

Actually, I got lucky that RecognizeBinary() changed something.
I got rid of all the issues -including JVM complaints- by compiling with MAX_ORDER=6, as in the default configuration of KenLM compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants