Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major refactoring of internal label logic #2645

Merged
merged 15 commits into from
Feb 25, 2022
Merged

Conversation

alanakbik
Copy link
Collaborator

@alanakbik alanakbik commented Feb 22, 2022

This PR makes a refactoring to Flair's internal label logic.

In detail:

  • complex label classes like SpanLabel, RelationLabel etc. are removed in favor of a single Label class for all types of label
  • each Label now has a pointer to the data point to which it belongs. This means that labels cannot be instantiated without a DataPoint object
  • the Token, Span and Relation data points not inherit from _PartOfSentence, a new special DataPoint subtype. They now require a pointer to the Sentence object from which they stem. The new logic causes all labels added to a _PartOfSentence also get registered to the Sentence. So instead of previously:
sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

# create span
span = Span(sentence[0:4])
# make a Span-label
span_label = SpanLabel(span=span, value='University')
# add Span-label to sentence
sentence.add_complex_label(typename='ner',  label=span_label)

you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.

# get Span
span =  sentence[0:4]
# add label (will automatically register to Sentence as well)
span.add_label("ner", "Organization")
  • this in turn simplifies the signature of the forward_pass method of DefaultClassifier to return 3 instead of 4 arguments (Sentences no longer needed). It also does away with the unintuitive spawn logic we no longer need.
  • a number of fields have been added or moved up to the DataPoint class, for convenience, including properties to get start_position and end_position of datapoints, their text, their tag and score (if they have only one tag) and an unlabeled_identifier
  • a number of methods like get_tag and add_tag have been removed from Token in favor of the get_label and add_label method of the parent DataPoint class
  • the get_spans method of Sentence is back, and a similar get_relations method added
  • the Tokenizer classes no longer return lists of Token, rather lists of strings that the Sentence object converts to tokens, centralizing the offset and whitespace_after detection in one place
  • many unit tests added

@alanakbik alanakbik changed the title WIP: Major refactoring of internal label logic Major refactoring of internal label logic Feb 23, 2022
@alanakbik alanakbik merged commit 016cd52 into master Feb 25, 2022
@alanakbik alanakbik deleted the refactor_annotations branch February 25, 2022 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant