Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known issue with section splitting in heuristic_tokenize.py #4

Open
EmilyAlsentzer opened this issue Jun 29, 2019 · 0 comments
Open

Comments

@EmilyAlsentzer
Copy link
Owner

There are two bugs in the sent_tokenize_rules function in heuristic_tokenize.py

We have not fixed them in this repo because we want to maintain the reproducibility of
our code at the time the work was published. However, anyone wanting to extend this work should make the following changes in heuristic_tokenize.py:

  1. fix a bug on line #168 where . should be replaced with \. i.e. should be while re.search('\n\s*%d\.'%n,segment):
  2. add else statement (else: new_segments.append(segments[i])) to the if statement at line 287 if (i == N-1) or is_title(segments[i+1]): This fixes a bug where lists that have a title header will lose their first entry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant