Spellcheck does not recognise words containing hyphens #184

sbraz · 2021-03-10T22:44:27Z

Hi,
I'm not sure how to work around this issue but I see that the spell checker tokenizer splits words on hyphens:

Line 255 in 30a2ed8

word = re.split(r"(?!')\W+", self.text[i:])[0]

This can break spellchecking for the following French subtitle:

1
00:00:00,000 --> 00:00:03,000
twin-set talkie-walkie

Although both words are present in the dictionary, they won't be recognised because neither twin or talkie are listed:

$ grep -P '^(twin|talkie)' /usr/share/myspell/dicts/fr_FR.dic
talkies-walkies/D'Q' po:nom is:mas is:pl
talkie-walkie/L'D'Q' po:nom is:mas is:sg
twin-set/S.() po:nom is:mas

I think in general splitting on hyphens is a good idea but maybe we could do something to expand the selection to the full word when the checker returns a mistake. It doesn't seem very straightforward, do you suppose it's worth it?

Off-topic questtion: does the name "aeidon" mean anything?

The text was updated successfully, but these errors were encountered:

otsaloma · 2021-03-12T20:33:01Z

It's not fixable in the tokenizer, but in the SpellChecker class we do already have some special cases. Currently there's one character of leading and trailing context that's used. Technically that could be extended to cover this, but I'm not really convinced this is worth it. I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.

https://github.com/otsaloma/gaupol/blob/master/aeidon/spell.py#L61

Off-topic questtion: does the name "aeidon" mean anything?

I had to pick a name when separating that user-interface independent module from the codebase. Gaupol doesn't mean anything either, so I continued with same style and also wanted the length to match, so that I could do a search replace across the codebase without needing to manually fix some hanging indents.

sbraz · 2021-03-12T23:11:56Z

I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.

Here are some ugly one-liners that seem to show words that would not be recognised:

$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/fr_FR.dic -o | tr '-' '\n' | aspell --list -l fr | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/fr_FR.dic;done | sort -u | wc -l
1315
$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/en_GB.dic -o | tr '-' '\n' | aspell --list -l en | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/en_GB.dic;done | sort -u | wc -l
892

There are probably some false positives but it's still not negligible IMO. Do you think it would affect performance a lot to take those into account?

If we do, we also need to take into account words like vol-au-vent for which we'd need to add context in both directions.

otsaloma · 2021-03-14T20:25:37Z

I can't run those greps, my Debian doesn't seem to have myspell, only hunspell files and they probably have a different format.

I think maybe we could make the function signature

def check(self, word, extended_word="", leading_context="", trailing_context=""):

And that extended_word would then extend both ways at least by dashes. It's doable of course. I don't see a performance issue there, just a question of how much to complicate the code for special cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spellcheck does not recognise words containing hyphens #184

Spellcheck does not recognise words containing hyphens #184

sbraz commented Mar 10, 2021

otsaloma commented Mar 12, 2021 •

edited

Loading

sbraz commented Mar 12, 2021

otsaloma commented Mar 14, 2021

Spellcheck does not recognise words containing hyphens #184

Spellcheck does not recognise words containing hyphens #184

Comments

sbraz commented Mar 10, 2021

otsaloma commented Mar 12, 2021 • edited Loading

sbraz commented Mar 12, 2021

otsaloma commented Mar 14, 2021

otsaloma commented Mar 12, 2021 •

edited

Loading