Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for v2.0 of HIPE-2022 data #2684

Merged
merged 10 commits into from
Mar 25, 2022
Merged

Add support for v2.0 of HIPE-2022 data #2684

merged 10 commits into from
Mar 25, 2022

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented Mar 21, 2022

Hi,

this PR adds support for v2.0 of the HIPE-2022 dataset.

The following changes were introduced:

  • Version number is now part of the (internal, chached) dataset path. Now, ~/.flair/datasets/ner_hipe_2022 gets the version number appended, e.g. ~/.flair/datasets/ner_hipe_2022/v2.0.
  • The branch name argument is added to the NER_HIPE_2022 constructor. It is now possible to specify the desired branch name for the HIPE upstream data repo.
  • New splits (train and dev) for German, English and French for the AJMC dataset are introduced with the v2.0.
  • SONAR Dev split is updated.
  • NewsEye German train split is filtered (removal of unannotated documents).
  • Extensive tests for label dictionary (all datasets/languages/splits) added.

PR in the HIPE upstream data repo: hipe-eval/HIPE-2022-data#3 and release notes here.

@stefan-it stefan-it marked this pull request as ready for review March 22, 2022 10:41
@stefan-it
Copy link
Member Author

stefan-it commented Mar 22, 2022

@alanakbik All HIPE-related tests pass now, but CI fails (unrelated) due to:

/home/runner/work/_temp/fc2b2d89-438b-499e-a436-3bfb6271da5c.sh: line 1:  1930 Killed
pytest --runintegration -vv
tests/test_hyperparameter.py::test_text_classifier_param_selector 
Error: Process completed with exit code 137.

@stefan-it stefan-it marked this pull request as draft March 22, 2022 13:58
@stefan-it
Copy link
Member Author

stefan-it commented Mar 22, 2022

I've just found a label parsing bug when loading German AJMC:

2022-03-22 14:57:53,488 Corpus contains the labels: ner (#587)
2022-03-22 14:57:53,488 1562 instances in dict, 0 instances are UNK'ed
2022-03-22 14:57:53,488 Most commonly observed 'ner'-labels are [('scope', 677), ('pers', 549), ('work', 298), ('loc', 29), ('object', 6), ('date', 2), ('δοα', 1)]
2022-03-22 14:57:53,488 Created (for label 'ner') Dictionary with 8 tags: <unk>, scope, pers, work, loc, object, date, δοα

So ('δοα', 1) should not appear in that dictionary.

I'm going to prepare some test cases to check the label dictionary!

PR is marked as draft again.

@stefan-it
Copy link
Member Author

Found the root cause:

68	O	_	O	_	_	O	_	_	NoSpaceAfter
.	O	_	O	_	_	O	_	_	_
μηδὲ	O	_	O	_	_	O	_	_	NoSpaceAfter
.	O	_	O	_	_	O	_	_	_
 ἄνδοα	O	_	O	_	_	O	_	_	NoSpaceAfter
,	O	_	O	_	_	O	_	_	_
ſieh	O	_	O	_	_	O	_	_	_
niht	O	_	O	_	_	O	_	_	_
dem	O	_	O	_	_	O	_	_	_
Manne	O	_	O	_	_	O	_	_	_
als	O	_	O	_	_	O	_	_	_
einem	O	_	O	_	_	O	_	_	_
Unheil	O	_	O	_	_	O	_	_	_
für	O	_	O	_	_	O	_	_	_
dich	O	_	O	_	_	O	_	_	_
entgegen	O	_	O	_	_	O	_	_	NoSpaceAfter
.	O	_	O	_	_	O	_	_	EndOfSentence
69	O	_	O	_	_	O	_	_	NoSpaceAfter
.	O	_	O	_	_	O	_	_	_

There's a leading space for the token ἄνδοα in line 5.

@stefan-it
Copy link
Member Author

stefan-it commented Mar 22, 2022

I implemented two bugfixes: tab is now used as column delimiter (because in one case there's a two token word in the AJMC dataset), leading and trailing spaces are now removed for a line in the dataset.

Additionally, I added test cases for the expected label set (reference label set is taken from the hipe2022-datasets-stats.ipynb notebook).

@stefan-it stefan-it marked this pull request as ready for review March 22, 2022 17:02
@stefan-it
Copy link
Member Author

CI again fails unrelated, all HIPE-related tests are good now, @alanakbik :

tests/test_datasets.py::ISORT PASSED                                     [ 34%]
tests/test_datasets.py::FLAKE8 PASSED                                    [ 34%]
tests/test_datasets.py::mypy PASSED                                      [ 34%]
tests/test_datasets.py::test_hipe_2022_corpus PASSED                     [ 38%]

@@ -4180,19 +4182,25 @@ def __init__(
}
}

# v2.0 only adds new language and splits for AJMC dataset
hipe_available_splits["v2.0"] = hipe_available_splits.get("v1.0").copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefan-it mypy complains due to this line, causing the unit tests to fail. The 'error' is printed at the end of the test output:

mypy exited with status 1.
_____________________ flair/datasets/sequence_labeling.py ______________________
4186: error: Item "None" of "Optional[Dict[str, Dict[str, List[str]]]]" has no attribute "copy"
===================================== mypy =====================================

It seems there is some problem with the copy() here. Perhaps it can be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! copy is necessary here, because otherwise v1.0 values would be changed, when changing v2.0 variables. The .get() method returns an optional causing mypy to fail here, so I used the normal index access now. CI is green then 🤗

@alanakbik
Copy link
Collaborator

@stefan-it thanks for adding this!

@alanakbik alanakbik merged commit ec5cce0 into master Mar 25, 2022
@alanakbik alanakbik deleted the hipe-2022-v2-update branch March 25, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants