GH-2720: handle consecutive whitespaces #2721

mauryaland · 2022-04-10T16:05:14Z

Related to the issue #2720.

One quick question, I do not fully understand the purpose of lines 789-791 in data.py:

if token.start_position == 0 and len(self) > 0:
    token.start_pos = len(self.to_original_text()) + self[-1].whitespace_after
    token.end_pos = token.start_pos + len(token.text)

Why are you taking the length of the sentence (since self refers to the sentence here) to get start_pos parameter?

alanakbik · 2022-05-04T12:14:22Z

@mauryaland thanks a lot for adding this and sorry for reviewing so late!

Regarding your question: I guess it is a bit inefficient to take sentence length. The idea is that tokens get added one after another to the sentence so each time when they get added, the sentence provides its current length as the start position of the new token. But with your change one could probably also use the position of the last token plus its whitespace_after information to get the start position of the new token.

GH-2720: handle consecutive whitespaces

0dbd72f

alanakbik merged commit 33b72e6 into flairNLP:master May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-2720: handle consecutive whitespaces #2721

GH-2720: handle consecutive whitespaces #2721

mauryaland commented Apr 10, 2022

alanakbik commented May 4, 2022

GH-2720: handle consecutive whitespaces #2721

GH-2720: handle consecutive whitespaces #2721

Conversation

mauryaland commented Apr 10, 2022

alanakbik commented May 4, 2022