Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellcheck does not recognise words containing hyphens #184

Open
sbraz opened this issue Mar 10, 2021 · 3 comments
Open

Spellcheck does not recognise words containing hyphens #184

sbraz opened this issue Mar 10, 2021 · 3 comments

Comments

@sbraz
Copy link
Contributor

sbraz commented Mar 10, 2021

Hi,
I'm not sure how to work around this issue but I see that the spell checker tokenizer splits words on hyphens:

word = re.split(r"(?!')\W+", self.text[i:])[0]

This can break spellchecking for the following French subtitle:

1
00:00:00,000 --> 00:00:03,000
twin-set talkie-walkie

Although both words are present in the dictionary, they won't be recognised because neither twin or talkie are listed:

$ grep -P '^(twin|talkie)' /usr/share/myspell/dicts/fr_FR.dic
talkies-walkies/D'Q' po:nom is:mas is:pl
talkie-walkie/L'D'Q' po:nom is:mas is:sg
twin-set/S.() po:nom is:mas

I think in general splitting on hyphens is a good idea but maybe we could do something to expand the selection to the full word when the checker returns a mistake. It doesn't seem very straightforward, do you suppose it's worth it?

Off-topic questtion: does the name "aeidon" mean anything?

@otsaloma
Copy link
Owner

otsaloma commented Mar 12, 2021

It's not fixable in the tokenizer, but in the SpellChecker class we do already have some special cases. Currently there's one character of leading and trailing context that's used. Technically that could be extended to cover this, but I'm not really convinced this is worth it. I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.

https://github.com/otsaloma/gaupol/blob/master/aeidon/spell.py#L61

Off-topic questtion: does the name "aeidon" mean anything?

I had to pick a name when separating that user-interface independent module from the codebase. Gaupol doesn't mean anything either, so I continued with same style and also wanted the length to match, so that I could do a search replace across the codebase without needing to manually fix some hanging indents.

@sbraz
Copy link
Contributor Author

sbraz commented Mar 12, 2021

I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.

Here are some ugly one-liners that seem to show words that would not be recognised:

$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/fr_FR.dic -o | tr '-' '\n' | aspell --list -l fr | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/fr_FR.dic;done | sort -u | wc -l
1315
$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/en_GB.dic -o | tr '-' '\n' | aspell --list -l en | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/en_GB.dic;done | sort -u | wc -l
892

There are probably some false positives but it's still not negligible IMO. Do you think it would affect performance a lot to take those into account?

If we do, we also need to take into account words like vol-au-vent for which we'd need to add context in both directions.

@otsaloma
Copy link
Owner

I can't run those greps, my Debian doesn't seem to have myspell, only hunspell files and they probably have a different format.

I think maybe we could make the function signature

def check(self, word, extended_word="", leading_context="", trailing_context=""):

And that extended_word would then extend both ways at least by dashes. It's doable of course. I don't see a performance issue there, just a question of how much to complicate the code for special cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants