Known issue with section splitting in heuristic_tokenize.py #4

EmilyAlsentzer · 2019-06-29T19:57:20Z

There are two bugs in the sent_tokenize_rules function in heuristic_tokenize.py

We have not fixed them in this repo because we want to maintain the reproducibility of
our code at the time the work was published. However, anyone wanting to extend this work should make the following changes in heuristic_tokenize.py:

fix a bug on line #168 where . should be replaced with \. i.e. should be while re.search('\n\s*%d\.'%n,segment):
add else statement (else: new_segments.append(segments[i])) to the if statement at line 287 if (i == N-1) or is_title(segments[i+1]): This fixes a bug where lists that have a title header will lose their first entry.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Known issue with section splitting in heuristic_tokenize.py #4

Known issue with section splitting in heuristic_tokenize.py #4

EmilyAlsentzer commented Jun 29, 2019

Known issue with section splitting in heuristic_tokenize.py #4

Known issue with section splitting in heuristic_tokenize.py #4

Comments

EmilyAlsentzer commented Jun 29, 2019