Segmentation faults with a small corpus #37

nricke · 2015-07-10T14:22:46Z

Hi,

I can't get KenLM working on my corpus.

I've followed the usual steps:
./bin/lmplz -T /tmp/ --text corpus.txt --arpa myarpa.arpa
./bin/build_binary myarpa.arpa my_probing_model.mmap

Then I tried the snippet from here:
https://kheafield.com/code/kenlm/developers/

With a TrieModel, it always ends with a segfault, regardless of MAX_ORDER. The error occurs here:

lm::ngram::trie::TrieSearch<lm::ngram::DontQuantize, lm::ngram::trie::DontBhiksha>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) ()

With a ProbingModel, I get a segfault only for MAX_ORDER < 5:

lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::ResumeScore(unsigned int const*, unsigned int const*, unsigned char, unsigned long&, float*, unsigned char&, lm::FullScoreReturn&)

For MAX_ORDER = 5, the C++ program runs only with a couple of Valgrind errors:

==3445== Invalid write of size 8
==3445==    at 0x411B1A: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x409920: lm::ngram::ProbingModel::ProbingModel(char const*, lm::ngram::Config const&) (model.hh:136)

Invalid write of size 8
==3445==    at 0x43A06B: lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>::SetupMemory(unsigned char*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411515: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::SetupMemory(void*, std::vector<unsigned long, std::allocator<unsigned long> > const&, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)
==3445==    by 0x411FC0: lm::ngram::detail::GenericModel<lm::ngram::detail::HashedSearch<lm::ngram::BackoffValue>, lm::ngram::ProbingVocabulary>::GenericModel(char const*, lm::ngram::Config const&) (in /home/romain/dev/keukeyVoice/corpora/kenlm/ktest)

But a JNA wrapper around the same snippet raises a "malloc(): memory corruption" when loading the model.

I tried with and without pruning, with order 2 and 3, both with KenLM from the download section and this of github. The size of the corpus is about 1Gb.
One peculiarity of the vocabulary is that it contains A LOT of words that are substring of other words of the vocabulary.

I'm aware that it's probably not enough information for proper debugging, but I would be interested to know whether the valgrind errors are ok and if you can suggest me anything to help me find the problem.

My system is Mint 17. The compilation succeeded with no warning.

The text was updated successfully, but these errors were encountered:

kpu · 2015-07-10T14:49:12Z

Does the query program work as expected?

There are a lot of unaligned accesses by design; are you using x86_64?

kpu · 2015-07-10T14:52:34Z

Oh also if you specify a vocab id that's out of range, it reserves the right to segfault.

nricke · 2015-07-13T03:13:30Z

Thanks!

I fixed my issue with TrieModel. Calling RecognizeBinary() before new TrieModel() did the trick.

As for ProbingModel, if the Valgrind's invalid writes are expected, then it's fine. I guess JNA or the JVM are to blame for the segfaults.

FYI, on the TrieModel, the JVM complains here:

lm::ngram::trie::BitPackedMiddle<lm::ngram::trie::DontBhiksha>::Find(unsigned int, lm::ngram::trie::NodeRange&, unsigned long&)

I am checking the vocab id with index != model->GetVocabulary().NotFound().

I'm using a x86_64 indeed. The query program is working well.

nricke · 2015-07-13T05:02:09Z

Actually, I got lucky that RecognizeBinary() changed something.
I got rid of all the issues -including JVM complaints- by compiling with MAX_ORDER=6, as in the default configuration of KenLM compilation.

nricke closed this as completed Jul 13, 2015

Gavin90s mentioned this issue Mar 11, 2021

core dump occurred when load LM model #328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation faults with a small corpus #37

Segmentation faults with a small corpus #37

nricke commented Jul 10, 2015

kpu commented Jul 10, 2015

kpu commented Jul 10, 2015

nricke commented Jul 13, 2015

nricke commented Jul 13, 2015

Segmentation faults with a small corpus #37

Segmentation faults with a small corpus #37

Comments

nricke commented Jul 10, 2015

kpu commented Jul 10, 2015

kpu commented Jul 10, 2015

nricke commented Jul 13, 2015

nricke commented Jul 13, 2015