Skip to content

Letter-precise html tokenization #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Oct 2, 2017
Merged

Letter-precise html tokenization #49

merged 31 commits into from
Oct 2, 2017

Conversation

whalebot-helmsman
Copy link
Contributor

For loseless html detokenization we should

  • store positions of our text tokens in source html file
  • pass it to html tokens
  • use this information during detokenization

Also we have to divide tokenization and START/END tag cleanup process, as we want tokens from clean and tagged tree in same time

@codecov
Copy link

codecov bot commented Sep 29, 2017

Codecov Report

Merging #49 into master will increase coverage by 0.05%.
The diff coverage is 89.36%.

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
+ Coverage   80.86%   80.92%   +0.05%     
==========================================
  Files          37       39       +2     
  Lines        1866     1950      +84     
==========================================
+ Hits         1509     1578      +69     
- Misses        357      372      +15

if is_tail:
source = elem.tail

modded = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use a list here, and join it at the end? The current code is not O(N^2) only in CPython.

@@ -41,6 +45,8 @@ class HtmlToken(_HtmlToken):
* :attr:`elem` is the current html block (as lxml's Element) - most
likely you want :attr:`parent` instead of it
* :attr:`is_tail` flag indicates that token belongs to element tail
* :attr:`position is position of token start in parent text
* :attr:`length is length of token in parent text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify if we're talking about bytes positions or unicode positions



class DefaultTokenizer(WordTokenizer):
def tokenize(self, text):
tokens = super(DefaultTokenizer, self).tokenize(text)
# remove standalone commas and semicolons
# as they broke tag sets, e.g. PERSON->FUNCTION in case "PERSON, FUNCTION"
# as they broke tag sets
# , e.g. PERSON->FUNCTION in case "PERSON, FUNCTION"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nitpick: comma should be on a line above :)

break
i += shift

def tokenize(self, text):
return [t for t in self._tokenize(text) if t]
return [t for t in self._tokenize(text) if t.chars]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make https://github.com/scrapinghub/webstruct/pull/36/files obsolete?

Also, it seems we can easily make this backwards compatible by using another name for tokenize, and leaving existing tokenize method as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does same what span want to do. But 36th pull request also contains some regexp modifications.

You want to move new code to separate method, e.g. span_tokenize, and keep old method tokenize calling for span_tokenize internally, which will be return only text tokens? This is possible, but there is no consumers for old method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep regexp modifications out of this PR; #36 was not merged because quality was decreasing a bit after tokenization changes.

As for consumers, that's true there are no consumers in webstruct, but we never know as it is open source :) Also, tokenizer is made to be compatible with nltk's tokenizer; it is not a good design to return a different data type in an overridden method.

return [t[0] for t in tokens], [t[1] for t in tokens]

@classmethod
def from_indicies(Cls, indicies, input_tokens):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • a typo: it should be from_indices;
  • argument name should be cls according to pep8

@@ -11,23 +11,31 @@ class IobEncoder(object):

>>> iob_encoder = IobEncoder()
>>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"]
>>> iob_encoder.encode(input_tokens)
>>> [p for p in IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)]
[('John', 'B-PER'), ('said', 'O')]

Get the result in another format using ``encode_split`` method::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode_split method is removed.

>>> tokens, tags
(['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])

Note that IobEncoder is stateful. This means you can encode incomplete
stream and continue the encoding later::

>>> iob_encoder = IobEncoder()
>>> iob_encoder.encode(["__START_PER__", "John"])
>>> input_tokens_partial = ["__START_PER__", "John"]
>>> tokens = iob_encoder.encode(input_tokens_partial)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.encode method no longer returns tokens, so it could be better to rename the variable.

@@ -36,7 +44,7 @@ class IobEncoder(object):

Group results to entities::

>>> iob_encoder.group(iob_encoder.encode(input_tokens))
>>> iob_encoder.group([p for p in IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens) is repeated in all examples. I wonder if we should keep encode method backwards compatible, and introduce another method, which uses indices; encode implementation would be IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is repeated in test cases only, we can define a function in tests code.

@kmike
Copy link
Member

kmike commented Sep 29, 2017

Keeping positions of text and html tokens is a great addition 👍

Vostretsov Nikita added 2 commits September 29, 2017 10:58
@@ -12,9 +12,9 @@ class WordTokenizer(object):

>>> from nltk.tokenize.treebank import TreebankWordTokenizer # doctest: +SKIP
>>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com'''
>>> TreebankWordTokenizer().tokenize(s) # doctest: +SKIP
>>> TreebankWordTokenizer().span_tokenize(s) # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

span_tokenize returns a different kind of output, it can be a good time to either fix tests, or remove them

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Email', ':', 'muffins', '@', 'gmail.com']
>>> WordTokenizer().tokenize(s)
>>> WordTokenizer().span_tokenize(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in nltk span_tokenize returns (start, end) tuples; it can be better to either implemente a compatible API or use a different name.

text = text or ''
input_tokens = [t for t in self.text_tokenize_func(text)]
input_tokens = self._limit_tags(input_tokens)
input_tokens = [TextToken(chars=six.text_type(t.chars),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unicode doesn't look right here; if t.chars if unicode (str in Python 3) then conversion is not needed; if t.chars is bytes, then conversion should use a proper encoding, not sys.getdefaultencoding() which is often ascii

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion is the same as it was before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of tests awaits unicode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All real encoding/decoding handled by lxml. lxml uses utf-8 as it is internal representation. I think we can add test with real unicode and safely remove this conversion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I finally recalled why do we have such code! In Python 2.x lxml returns bytes for ASCII-only data and unicode for non-ascii data; this code ensures everything is unicode. It is only active for ascii-only bytes in Python 2.x, and no-op in all other cases, so it works as intended. Sorry for a false alarm.


>>> WordTokenizer().tokenize("Saudi Arabia-") # doctest: +SKIP
['Saudi', 'Arabia', '-']
>>> WordTokenizer().segment_words("Phone:855-349-1914")
Copy link
Member

@kmike kmike Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should enable these tests, they are like @xfail, i.e. the result is not what we want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should remove it at all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are nice test cases, so I'd prefer to keep them in some form, maybe convert them to pytest.mark.xfail tests.

@kmike
Copy link
Member

kmike commented Sep 29, 2017

It looks good, besides a minor comment about tests.
Do you know how much is speed affected? Tokenization used to be one of the bottlenecks.

@whalebot-helmsman
Copy link
Contributor Author

I don't how speed is affected. Do we have some kind of benchmark?

@kmike
Copy link
Member

kmike commented Sep 29, 2017

There is no benchmark, but you can load all trees in our dataset and check how long does it take to tokenize them all before and after the change (see #13 or #15).

@whalebot-helmsman
Copy link
Contributor Author

whalebot-helmsman commented Sep 29, 2017

On proposed benchmark(3 times load ~300 html pages) new version is 10% slower than old

[nikita@dsc-dev-01:~/dev/webstruct]$ runinenv.sh ~/ves/webstruct python -m webstruct.html_tokenizer_benchmark
Executing python -m webstruct.html_tokenizer_benchmark in /home/nikita/ves/webstruct
22.220800561946817

[nikita@dsc-dev-01:~/dev/webstruct]$ runinenv.sh ~/ves/webstruct python -m webstruct.html_tokenizer_benchmark
Executing python -m webstruct.html_tokenizer_benchmark in /home/nikita/ves/webstruct
24.135991828050464

def test_phone(self):
return self.do_tokenize(
"Phone:855-349-1914",
[TextToken(chars='Phone:855-349-1914', position=0, length=18)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the output we're expecting, phone should be separated (like it was in old doctests)

@kmike
Copy link
Member

kmike commented Oct 2, 2017

Thanks @whalebot-helmsman! I think the slowdown is tolerable, and having exact positions would make a foundation for other features, e.g. we can recover entity text in a smarter way when extracting them, or use commas as features without unconditionally removing them in tokenizer.

@kmike kmike merged commit f9190c3 into scrapinghub:master Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants