Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for MasakhaPOS Dataset #3247

Merged
merged 10 commits into from
Aug 11, 2023
Merged

Add support for MasakhaPOS Dataset #3247

merged 10 commits into from
Aug 11, 2023

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented May 23, 2023

Hi,

this PR adds support for the recently proposed MasakhaPOS Dataset.

Details can be found in this tweet.

The dataset is available in this repo: https://github.com/masakhane-io/masakhane-pos

I received preprint of the paper and wrote unit tests to check number of parsed sentences for dataset splits for each language.

Example usage of MasakhaPOS:

from flair.datasets import MASAKHA_POS

corpus = MASAKHA_POS(languages="bam")

@stefan-it
Copy link
Member Author

/cc @dadelani

@stefan-it
Copy link
Member Author

@alanakbik Please let me know, if dataset name is ok: it does not quite match into the UD_ naming scheme, because dataset is not in Universal Dependencies format, as it only has token and upos.

@dadelani
Copy link

@stefan-it , the dataset name is MasakhaPOS, arXiv paper will be out tomorrow

@stefan-it stefan-it changed the title Add support for AfricaPOS Dataset Add support for MasakhaPOS Dataset May 23, 2023
@stefan-it
Copy link
Member Author

Thanks @dadelani for feedback, I corrected the dataset name now :)

@stefan-it
Copy link
Member Author

Preprint is now available here 🤗

@stefan-it
Copy link
Member Author

Hi @helpmefindaname do you accidentally have an idea, why poetry stage will get stuck in timeout:

https://github.com/flairNLP/flair/actions/runs/5062453919/jobs/9099744358?pr=3247

This was the already the case yesterday, I've just re-ran the build, but same error.

@helpmefindaname
Copy link
Collaborator

Hi @helpmefindaname do you accidentally have an idea, why poetry stage will get stuck in timeout:

https://github.com/flairNLP/flair/actions/runs/5062453919/jobs/9099744358?pr=3247

This was the already the case yesterday, I've just re-ran the build, but same error.

The dependency resolution took needlessly long, as it tried out all 300+ boto3 versions with an incompatible dependency before taking the right one.
#3249 should fix the problems.

@stefan-it
Copy link
Member Author

After #3258 I will do a rebase now :)

@stefan-it stefan-it marked this pull request as ready for review August 11, 2023 10:07
@alanakbik
Copy link
Collaborator

@stefan-it thanks for adding this! And thanks @dadelani for creating this dataset!

@alanakbik alanakbik merged commit 10a63dd into master Aug 11, 2023
1 check passed
@alanakbik alanakbik deleted the add-africa-pos-dataset branch August 11, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants