-
Notifications
You must be signed in to change notification settings - Fork 59
Letter-precise html tokenization #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #49 +/- ##
==========================================
+ Coverage 80.86% 80.92% +0.05%
==========================================
Files 37 39 +2
Lines 1866 1950 +84
==========================================
+ Hits 1509 1578 +69
- Misses 357 372 +15 |
webstruct/html_tokenizer.py
Outdated
if is_tail: | ||
source = elem.tail | ||
|
||
modded = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use a list here, and join it at the end? The current code is not O(N^2) only in CPython.
webstruct/html_tokenizer.py
Outdated
@@ -41,6 +45,8 @@ class HtmlToken(_HtmlToken): | |||
* :attr:`elem` is the current html block (as lxml's Element) - most | |||
likely you want :attr:`parent` instead of it | |||
* :attr:`is_tail` flag indicates that token belongs to element tail | |||
* :attr:`position is position of token start in parent text | |||
* :attr:`length is length of token in parent text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clarify if we're talking about bytes positions or unicode positions
webstruct/text_tokenizers.py
Outdated
|
||
|
||
class DefaultTokenizer(WordTokenizer): | ||
def tokenize(self, text): | ||
tokens = super(DefaultTokenizer, self).tokenize(text) | ||
# remove standalone commas and semicolons | ||
# as they broke tag sets, e.g. PERSON->FUNCTION in case "PERSON, FUNCTION" | ||
# as they broke tag sets | ||
# , e.g. PERSON->FUNCTION in case "PERSON, FUNCTION" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nitpick: comma should be on a line above :)
webstruct/text_tokenizers.py
Outdated
break | ||
i += shift | ||
|
||
def tokenize(self, text): | ||
return [t for t in self._tokenize(text) if t] | ||
return [t for t in self._tokenize(text) if t.chars] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make https://github.com/scrapinghub/webstruct/pull/36/files obsolete?
Also, it seems we can easily make this backwards compatible by using another name for tokenize, and leaving existing tokenize method as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does same what span want to do. But 36th pull request also contains some regexp modifications.
You want to move new code to separate method, e.g. span_tokenize, and keep old method tokenize calling for span_tokenize internally, which will be return only text tokens? This is possible, but there is no consumers for old method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep regexp modifications out of this PR; #36 was not merged because quality was decreasing a bit after tokenization changes.
As for consumers, that's true there are no consumers in webstruct, but we never know as it is open source :) Also, tokenizer is made to be compatible with nltk's tokenizer; it is not a good design to return a different data type in an overridden method.
webstruct/sequence_encoding.py
Outdated
return [t[0] for t in tokens], [t[1] for t in tokens] | ||
|
||
@classmethod | ||
def from_indicies(Cls, indicies, input_tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- a typo: it should be
from_indices
; - argument name should be
cls
according to pep8
webstruct/sequence_encoding.py
Outdated
@@ -11,23 +11,31 @@ class IobEncoder(object): | |||
|
|||
>>> iob_encoder = IobEncoder() | |||
>>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"] | |||
>>> iob_encoder.encode(input_tokens) | |||
>>> [p for p in IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)] | |||
[('John', 'B-PER'), ('said', 'O')] | |||
|
|||
Get the result in another format using ``encode_split`` method:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode_split
method is removed.
webstruct/sequence_encoding.py
Outdated
>>> tokens, tags | ||
(['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O']) | ||
|
||
Note that IobEncoder is stateful. This means you can encode incomplete | ||
stream and continue the encoding later:: | ||
|
||
>>> iob_encoder = IobEncoder() | ||
>>> iob_encoder.encode(["__START_PER__", "John"]) | ||
>>> input_tokens_partial = ["__START_PER__", "John"] | ||
>>> tokens = iob_encoder.encode(input_tokens_partial) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.encode method no longer returns tokens, so it could be better to rename the variable.
webstruct/sequence_encoding.py
Outdated
@@ -36,7 +44,7 @@ class IobEncoder(object): | |||
|
|||
Group results to entities:: | |||
|
|||
>>> iob_encoder.group(iob_encoder.encode(input_tokens)) | |||
>>> iob_encoder.group([p for p in IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)
is repeated in all examples. I wonder if we should keep encode
method backwards compatible, and introduce another method, which uses indices; encode
implementation would be IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is repeated in test cases only, we can define a function in tests code.
Keeping positions of text and html tokens is a great addition 👍 |
webstruct/text_tokenizers.py
Outdated
@@ -12,9 +12,9 @@ class WordTokenizer(object): | |||
|
|||
>>> from nltk.tokenize.treebank import TreebankWordTokenizer # doctest: +SKIP | |||
>>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com''' | |||
>>> TreebankWordTokenizer().tokenize(s) # doctest: +SKIP | |||
>>> TreebankWordTokenizer().span_tokenize(s) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
span_tokenize returns a different kind of output, it can be a good time to either fix tests, or remove them
webstruct/text_tokenizers.py
Outdated
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Email', ':', 'muffins', '@', 'gmail.com'] | ||
>>> WordTokenizer().tokenize(s) | ||
>>> WordTokenizer().span_tokenize(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in nltk span_tokenize returns (start, end) tuples; it can be better to either implemente a compatible API or use a different name.
webstruct/html_tokenizer.py
Outdated
text = text or '' | ||
input_tokens = [t for t in self.text_tokenize_func(text)] | ||
input_tokens = self._limit_tags(input_tokens) | ||
input_tokens = [TextToken(chars=six.text_type(t.chars), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unicode doesn't look right here; if t.chars if unicode (str in Python 3) then conversion is not needed; if t.chars is bytes, then conversion should use a proper encoding, not sys.getdefaultencoding() which is often ascii
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion is the same as it was before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of tests awaits unicode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All real encoding/decoding handled by lxml. lxml uses utf-8 as it is internal representation. I think we can add test with real unicode and safely remove this conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I finally recalled why do we have such code! In Python 2.x lxml returns bytes for ASCII-only data and unicode for non-ascii data; this code ensures everything is unicode. It is only active for ascii-only bytes in Python 2.x, and no-op in all other cases, so it works as intended. Sorry for a false alarm.
webstruct/text_tokenizers.py
Outdated
|
||
>>> WordTokenizer().tokenize("Saudi Arabia-") # doctest: +SKIP | ||
['Saudi', 'Arabia', '-'] | ||
>>> WordTokenizer().segment_words("Phone:855-349-1914") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should enable these tests, they are like @xfail
, i.e. the result is not what we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should remove it at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are nice test cases, so I'd prefer to keep them in some form, maybe convert them to pytest.mark.xfail tests.
It looks good, besides a minor comment about tests. |
I don't how speed is affected. Do we have some kind of benchmark? |
On proposed benchmark(3 times load ~300 html pages) new version is 10% slower than old
|
This reverts commit 75a9698.
def test_phone(self): | ||
return self.do_tokenize( | ||
"Phone:855-349-1914", | ||
[TextToken(chars='Phone:855-349-1914', position=0, length=18)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the output we're expecting, phone should be separated (like it was in old doctests)
Thanks @whalebot-helmsman! I think the slowdown is tolerable, and having exact positions would make a foundation for other features, e.g. we can recover entity text in a smarter way when extracting them, or use commas as features without unconditionally removing them in tokenizer. |
For loseless html detokenization we should
Also we have to divide tokenization and START/END tag cleanup process, as we want tokens from clean and tagged tree in same time