-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for numbers in sentences can be switched off #100
base: master
Are you sure you want to change the base?
Check for numbers in sentences can be switched off #100
Conversation
If the CorporaCreator is used with data in which it is valid that sentences contain numbers, there should be a way to allow them. With the optional command line parameter "-c" this check can now be skipped. Usage: create-corpora [other args] -c {true, false, t, f 0, 1, y, n, yes, no}
When is it valid to have numbers? |
I am currently working with a phonetic transcription of the CommonVoice data set. For most SpeechRecognition tools, each symbol can only be one character long, which is why, for example, I coded characters like "a:" with numbers and specified them in the alphabet. |
What about unicode? |
The phonetic characters are already unicode characters. However, in phonetic transcription there are symbols that are composed of several. Of course I could simply replace the numbers with other characters, but I have chosen this representation in my scripts, which preprocess the data, and sorted out invalid sentences in advance. If the CorporaCreator should only be there to process orthographic sentences, this feature does not have to be merged. But if you want to have the possibility to process data of other forms, I think this should be a feature. |
Ugh. For some reason I though things like "kː" were single code points. Generally, however, the CorporaCreator is designed to be used to process orthographic sentence. So beyond your use case I'm not sure the suggested command line options would be of use. And |
You might want to issue a warning to the user that he should only use this feature if he is sure what he is doing. After all, it is an optional parameter.
However, if you have any concerns, I can fully understand that. If you want to close the PR, you are welcome to do so. |
If the CorporaCreator is used with data in which it is valid that
sentences contain numbers, there should be a way to allow them.
With the optional command line parameter "-c" this check can now be
skipped.
Usage:
create-corpora [other args] -c {true, false, t, f 0, 1, y, n, yes, no}