-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for v2.0 of HIPE-2022 data #2684
Conversation
@alanakbik All HIPE-related tests pass now, but CI fails (unrelated) due to:
|
I've just found a label parsing bug when loading German AJMC: 2022-03-22 14:57:53,488 Corpus contains the labels: ner (#587)
2022-03-22 14:57:53,488 1562 instances in dict, 0 instances are UNK'ed
2022-03-22 14:57:53,488 Most commonly observed 'ner'-labels are [('scope', 677), ('pers', 549), ('work', 298), ('loc', 29), ('object', 6), ('date', 2), ('δοα', 1)]
2022-03-22 14:57:53,488 Created (for label 'ner') Dictionary with 8 tags: <unk>, scope, pers, work, loc, object, date, δοα So I'm going to prepare some test cases to check the label dictionary! PR is marked as draft again. |
Found the root cause:
There's a leading space for the token |
I implemented two bugfixes: tab is now used as column delimiter (because in one case there's a two token word in the AJMC dataset), leading and trailing spaces are now removed for a line in the dataset. Additionally, I added test cases for the expected label set (reference label set is taken from the hipe2022-datasets-stats.ipynb notebook). |
CI again fails unrelated, all HIPE-related tests are good now, @alanakbik :
|
flair/datasets/sequence_labeling.py
Outdated
@@ -4180,19 +4182,25 @@ def __init__( | |||
} | |||
} | |||
|
|||
# v2.0 only adds new language and splits for AJMC dataset | |||
hipe_available_splits["v2.0"] = hipe_available_splits.get("v1.0").copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefan-it mypy complains due to this line, causing the unit tests to fail. The 'error' is printed at the end of the test output:
mypy exited with status 1.
_____________________ flair/datasets/sequence_labeling.py ______________________
4186: error: Item "None" of "Optional[Dict[str, Dict[str, List[str]]]]" has no attribute "copy"
===================================== mypy =====================================
It seems there is some problem with the copy()
here. Perhaps it can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks! copy
is necessary here, because otherwise v1.0 values would be changed, when changing v2.0 variables. The .get()
method returns an optional causing mypy
to fail here, so I used the normal index access now. CI is green then 🤗
@stefan-it thanks for adding this! |
Hi,
this PR adds support for v2.0 of the HIPE-2022 dataset.
The following changes were introduced:
~/.flair/datasets/ner_hipe_2022
gets the version number appended, e.g.~/.flair/datasets/ner_hipe_2022/v2.0
.NER_HIPE_2022
constructor. It is now possible to specify the desired branch name for the HIPE upstream data repo.PR in the HIPE upstream data repo: hipe-eval/HIPE-2022-data#3 and release notes here.