Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EUROPARL_NER_GERMAN #1849

Merged
merged 2 commits into from
Sep 6, 2020
Merged

Add EUROPARL_NER_GERMAN #1849

merged 2 commits into from
Sep 6, 2020

Conversation

stolzenp
Copy link
Contributor

@stolzenp stolzenp commented Sep 4, 2020

No description provided.

@alanakbik
Copy link
Collaborator

@stolzenp thanks for adding this! There is one problem with the code. If I do:

# load corpus
corpus = EUROPARL_NER_GERMAN()
print(corpus)

# print first test sentence
print(corpus.test[1])

I get an error. The problem is that the corpus is not in IOB format. I.e the sentence:

Ich ich PPER I-NC O
erkläre erklären VVFIN I-VC O
die d ART I-NC O
am am APPRART I-PC O
Donnerstag Donnerstag NN I-PC O
, , $, O O
den d ART I-NC O
28. 28. ADJA I-NC O
März März NN I-NC O
1996 1996 CARD B-NC O
unterbrochene unterbrochen ADJA I-NC O
Sitzungsperiode Sitzungsperiode NN I-NC O
des d ART B-NC O
Europäischen europäisch ADJA I-NC ORG
Parlaments Parlament NN I-NC ORG
für für APPR I-PC O
wiederaufgenommen wiederaufnehmen VVPP I-VC O
. . $. O O

should in fact be

Ich ich PPER I-NC O
erkläre erklären VVFIN I-VC O
die d ART I-NC O
am am APPRART I-PC O
Donnerstag Donnerstag NN I-PC O
, , $, O O
den d ART I-NC O
28. 28. ADJA I-NC O
März März NN I-NC O
1996 1996 CARD B-NC O
unterbrochene unterbrochen ADJA I-NC O
Sitzungsperiode Sitzungsperiode NN I-NC O
des d ART B-NC O
Europäischen europäisch ADJA I-NC I-ORG
Parlaments Parlament NN I-NC I-ORG
für für APPR I-PC O
wiederaufgenommen wiederaufnehmen VVPP I-VC O
. . $. O O

So after download you need to reformat the corpus such that all entity tags are prefixed by a "I-". So

  • "ORG" -> "I-ORG"
  • "PER" -> "I-PER"
  • "LOC" -> "I-LOC"
  • "MISC" -> "I-MISC"

Make sure that the Corpus object gets the reformated files and not the original ones. To check if everything works correctly, train an NER model with this corpus. If the training completes without error, the corpus is loaded correctly!

… it in EUROPARL_NER_GERMAN corpus class to add the IOB format to the dataset
@alanakbik
Copy link
Collaborator

Thanks @stolzenp - everything works now!

@alanakbik alanakbik merged commit ff94fb5 into flairNLP:master Sep 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants