Major refactoring of internal label logic #2645

alanakbik · 2022-02-22T15:42:11Z

This PR makes a refactoring to Flair's internal label logic.

In detail:

complex label classes like SpanLabel, RelationLabel etc. are removed in favor of a single Label class for all types of label
each Label now has a pointer to the data point to which it belongs. This means that labels cannot be instantiated without a DataPoint object
the Token, Span and Relation data points not inherit from _PartOfSentence, a new special DataPoint subtype. They now require a pointer to the Sentence object from which they stem. The new logic causes all labels added to a _PartOfSentence also get registered to the Sentence. So instead of previously:

sentence = Sentence("Humboldt Universität zu Berlin is located in Berlin .")

# create span
span = Span(sentence[0:4])
# make a Span-label
span_label = SpanLabel(span=span, value='University')
# add Span-label to sentence
sentence.add_complex_label(typename='ner',  label=span_label)

you can now just get a span from the sentence and add a label to it directly. It will get registered on the sentence as well.

# get Span
span =  sentence[0:4]
# add label (will automatically register to Sentence as well)
span.add_label("ner", "Organization")

this in turn simplifies the signature of the forward_pass method of DefaultClassifier to return 3 instead of 4 arguments (Sentences no longer needed). It also does away with the unintuitive spawn logic we no longer need.
a number of fields have been added or moved up to the DataPoint class, for convenience, including properties to get start_position and end_position of datapoints, their text, their tag and score (if they have only one tag) and an unlabeled_identifier
a number of methods like get_tag and add_tag have been removed from Token in favor of the get_label and add_label method of the parent DataPoint class
the get_spans method of Sentence is back, and a similar get_relations method added
the Tokenizer classes no longer return lists of Token, rather lists of strings that the Sentence object converts to tokens, centralizing the offset and whitespace_after detection in one place
many unit tests added

alanakbik added 15 commits February 11, 2022 15:36

First start of refactoring

ea29151

Refactoring of annotation logic

ce47251

Refactoring of forward_pass

8283803

Update eq operation for Labe

3eaf110

Remove commented out code

fdf8395

Remove unused code

626bdc6

Fix TARS models

538a531

Remove unused printline

5199af0

Label logic and new unit tests

660b300

Tokenizers return lists of string

e644914

Fix mypy and unit tests

1d0a21b

Remove printline and unused import

acf172a

Black formatting

21c103c

Fix flake errors

e9a5edc

More mypy fixes

6882cb5

alanakbik changed the title ~~WIP: Major refactoring of internal label logic~~ Major refactoring of internal label logic Feb 23, 2022

alanakbik merged commit 016cd52 into master Feb 25, 2022

alanakbik deleted the refactor_annotations branch February 25, 2022 14:32

tadejmagajna mentioned this pull request Apr 7, 2022

SciSpacyTokenizer.tokenize() function is broken #2710

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major refactoring of internal label logic #2645

Major refactoring of internal label logic #2645

alanakbik commented Feb 22, 2022 •

edited

Loading

Major refactoring of internal label logic #2645

Major refactoring of internal label logic #2645

Conversation

alanakbik commented Feb 22, 2022 • edited Loading

alanakbik commented Feb 22, 2022 •

edited

Loading