Letter-precise html tokenization #49

whalebot-helmsman · 2017-09-29T09:06:13Z

For loseless html detokenization we should

store positions of our text tokens in source html file
pass it to html tokens
use this information during detokenization

Also we have to divide tokenization and START/END tag cleanup process, as we want tokens from clean and tagged tree in same time

…eanup

codecov · 2017-09-29T09:09:29Z

Codecov Report

Merging #49 into master will increase coverage by 0.05%.
The diff coverage is 89.36%.

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
+ Coverage   80.86%   80.92%   +0.05%     
==========================================
  Files          37       39       +2     
  Lines        1866     1950      +84     
==========================================
+ Hits         1509     1578      +69     
- Misses        357      372      +15

kmike · 2017-09-29T09:43:00Z

webstruct/html_tokenizer.py

+            if is_tail:
+                source = elem.tail
+
+            modded = ''


Could you please use a list here, and join it at the end? The current code is not O(N^2) only in CPython.

kmike · 2017-09-29T09:43:45Z

webstruct/html_tokenizer.py

@@ -41,6 +45,8 @@ class HtmlToken(_HtmlToken):
    * :attr:`elem` is the current html block (as lxml's Element) - most
      likely you want :attr:`parent` instead of it
    * :attr:`is_tail` flag indicates that token belongs to element tail
+    * :attr:`position is position of token start in parent text
+    * :attr:`length is length of token in parent text


Let's clarify if we're talking about bytes positions or unicode positions

kmike · 2017-09-29T09:47:29Z

webstruct/text_tokenizers.py



 class DefaultTokenizer(WordTokenizer):
    def tokenize(self, text):
        tokens = super(DefaultTokenizer, self).tokenize(text)
        # remove standalone commas and semicolons
-        # as they broke tag sets, e.g. PERSON->FUNCTION in case "PERSON, FUNCTION"
+        # as they broke tag sets
+        # , e.g. PERSON->FUNCTION in case "PERSON, FUNCTION"


a nitpick: comma should be on a line above :)

kmike · 2017-09-29T10:05:22Z

webstruct/text_tokenizers.py

                    break
            i += shift

    def tokenize(self, text):
-        return [t for t in self._tokenize(text) if t]
+        return [t for t in self._tokenize(text) if t.chars]


Does it make https://github.com/scrapinghub/webstruct/pull/36/files obsolete?

Also, it seems we can easily make this backwards compatible by using another name for tokenize, and leaving existing tokenize method as-is.

Yes, it does same what span want to do. But 36th pull request also contains some regexp modifications.

You want to move new code to separate method, e.g. span_tokenize, and keep old method tokenize calling for span_tokenize internally, which will be return only text tokens? This is possible, but there is no consumers for old method.

I think we can keep regexp modifications out of this PR; #36 was not merged because quality was decreasing a bit after tokenization changes.

As for consumers, that's true there are no consumers in webstruct, but we never know as it is open source :) Also, tokenizer is made to be compatible with nltk's tokenizer; it is not a good design to return a different data type in an overridden method.

kmike · 2017-09-29T10:15:45Z

webstruct/sequence_encoding.py

+        return [t[0] for t in tokens], [t[1] for t in tokens]
+
+    @classmethod
+    def from_indicies(Cls, indicies, input_tokens):


a typo: it should be from_indices;

argument name should be cls according to pep8

kmike · 2017-09-29T10:17:57Z

webstruct/sequence_encoding.py

@@ -11,23 +11,31 @@ class IobEncoder(object):

        >>> iob_encoder = IobEncoder()
        >>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"]
-        >>> iob_encoder.encode(input_tokens)
+        >>> [p for p in IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)]
        [('John', 'B-PER'), ('said', 'O')]

    Get the result in another format using ``encode_split`` method::


encode_split method is removed.

kmike · 2017-09-29T10:19:11Z

webstruct/sequence_encoding.py

        >>> tokens, tags
        (['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])

    Note that IobEncoder is stateful. This means you can encode incomplete
    stream and continue the encoding later::

        >>> iob_encoder = IobEncoder()
-        >>> iob_encoder.encode(["__START_PER__", "John"])
+        >>> input_tokens_partial = ["__START_PER__", "John"]
+        >>> tokens = iob_encoder.encode(input_tokens_partial)


.encode method no longer returns tokens, so it could be better to rename the variable.

kmike · 2017-09-29T10:23:00Z

webstruct/sequence_encoding.py

@@ -36,7 +44,7 @@ class IobEncoder(object):

    Group results to entities::

-        >>> iob_encoder.group(iob_encoder.encode(input_tokens))
+        >>> iob_encoder.group([p for p in IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens)])


IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens) is repeated in all examples. I wonder if we should keep encode method backwards compatible, and introduce another method, which uses indices; encode implementation would be IobEncoder.from_indicies(iob_encoder.encode(input_tokens), input_tokens).

This pattern is repeated in test cases only, we can define a function in tests code.

kmike · 2017-09-29T10:25:15Z

Keeping positions of text and html tokens is a great addition 👍

kmike · 2017-09-29T13:01:12Z

webstruct/text_tokenizers.py

@@ -12,9 +12,9 @@ class WordTokenizer(object):

    >>> from nltk.tokenize.treebank import TreebankWordTokenizer  # doctest: +SKIP
    >>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com'''
-    >>> TreebankWordTokenizer().tokenize(s) # doctest: +SKIP
+    >>> TreebankWordTokenizer().span_tokenize(s) # doctest: +SKIP


span_tokenize returns a different kind of output, it can be a good time to either fix tests, or remove them

kmike · 2017-09-29T13:01:45Z

webstruct/text_tokenizers.py

    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Email', ':', 'muffins', '@', 'gmail.com']
-    >>> WordTokenizer().tokenize(s)
+    >>> WordTokenizer().span_tokenize(s)


in nltk span_tokenize returns (start, end) tuples; it can be better to either implemente a compatible API or use a different name.

kmike · 2017-09-29T13:06:45Z

webstruct/html_tokenizer.py

+        text = text or ''
+        input_tokens = [t for t in self.text_tokenize_func(text)]
+        input_tokens = self._limit_tags(input_tokens)
+        input_tokens = [TextToken(chars=six.text_type(t.chars),


unicode doesn't look right here; if t.chars if unicode (str in Python 3) then conversion is not needed; if t.chars is bytes, then conversion should use a proper encoding, not sys.getdefaultencoding() which is often ascii

This conversion is the same as it was before

One of tests awaits unicode

All real encoding/decoding handled by lxml. lxml uses utf-8 as it is internal representation. I think we can add test with real unicode and safely remove this conversion

Ah, I finally recalled why do we have such code! In Python 2.x lxml returns bytes for ASCII-only data and unicode for non-ascii data; this code ensures everything is unicode. It is only active for ascii-only bytes in Python 2.x, and no-op in all other cases, so it works as intended. Sorry for a false alarm.

kmike · 2017-09-29T15:34:43Z

webstruct/text_tokenizers.py

-
-    >>> WordTokenizer().tokenize("Saudi Arabia-")  # doctest: +SKIP
-    ['Saudi', 'Arabia', '-']
+    >>> WordTokenizer().segment_words("Phone:855-349-1914")


I don't think we should enable these tests, they are like @xfail, i.e. the result is not what we want

So we should remove it at all

I think they are nice test cases, so I'd prefer to keep them in some form, maybe convert them to pytest.mark.xfail tests.

kmike · 2017-09-29T15:45:01Z

It looks good, besides a minor comment about tests.
Do you know how much is speed affected? Tokenization used to be one of the bottlenecks.

whalebot-helmsman · 2017-09-29T15:53:43Z

I don't how speed is affected. Do we have some kind of benchmark?

kmike · 2017-09-29T15:57:53Z

There is no benchmark, but you can load all trees in our dataset and check how long does it take to tokenize them all before and after the change (see #13 or #15).

whalebot-helmsman · 2017-09-29T16:54:37Z

On proposed benchmark(3 times load ~300 html pages) new version is 10% slower than old

[nikita@dsc-dev-01:~/dev/webstruct]$ runinenv.sh ~/ves/webstruct python -m webstruct.html_tokenizer_benchmark
Executing python -m webstruct.html_tokenizer_benchmark in /home/nikita/ves/webstruct
22.220800561946817

[nikita@dsc-dev-01:~/dev/webstruct]$ runinenv.sh ~/ves/webstruct python -m webstruct.html_tokenizer_benchmark
Executing python -m webstruct.html_tokenizer_benchmark in /home/nikita/ves/webstruct
24.135991828050464

This reverts commit 75a9698.

kmike · 2017-10-02T14:39:48Z

webstruct/tests/test_text_tokenizer.py

+    def test_phone(self):
+        return self.do_tokenize(
+                "Phone:855-349-1914",
+                [TextToken(chars='Phone:855-349-1914', position=0, length=18)]


This is not the output we're expecting, phone should be separated (like it was in old doctests)

kmike · 2017-10-02T16:20:02Z

Thanks @whalebot-helmsman! I think the slowdown is tolerable, and having exact positions would make a foundation for other features, e.g. we can recover entity text in a smarter way when extracting them, or use commas as features without unconditionally removing them in tokenizer.

whalebot-helmsman and others added 14 commits September 21, 2017 16:04

text tokenizer return postions of token

36d56f2

update tests

2d4d2ef

separate statement for every action

80658ca

comma preserving test

c52e449

too much tokens around

8178776

encode in indices instead of entities

51c0932

handle empty lists

1a667ec

pass token length and position from TextToken to HtmlToken

24465b1

letter perfect detokenization

06befbb

do not cleanup tokenized tree by default, separate method for tree cl…

e5730b2

…eanup

update tests for separate tree cleaning

e340444

update tests for correct punctuation positions

89673c1

correct length for replaced quotes

7c45984

pep8

46fc4df

kmike reviewed Sep 29, 2017

View reviewed changes

Vostretsov Nikita added 3 commits September 29, 2017 09:52

comma at line end, not start

388170e

one join instead of many additions, dont be Schleimel

71caf61

correct formatting

37d7470

kmike reviewed Sep 29, 2017

View reviewed changes

add clarification

e93c6dc

kmike reviewed Sep 29, 2017

View reviewed changes

Vostretsov Nikita added 2 commits September 29, 2017 10:58

fix typo

e02c275

pep8

f26569f

Vostretsov Nikita added 2 commits September 29, 2017 11:21

preserve tokenize method for compatibility

d1aecbb

function to reduce code in tests

35a9d88

kmike reviewed Sep 29, 2017

View reviewed changes

Vostretsov Nikita added 4 commits September 29, 2017 14:02

remove test for nltk tokenizer

9033188

test our behaviour, which difers from original treebank tokenizer

c14f363

remove useless conversion

a071cd4

rename method to avoid confusion with nltk tokenize_span method

a33f564

kmike reviewed Sep 29, 2017

View reviewed changes

Vostretsov Nikita added 2 commits September 29, 2017 16:06

remove brittle tests

75a9698

small benchmark for html tokenizer

4729323

whalebot-helmsman added 2 commits October 2, 2017 12:04

Revert "remove brittle tests"

943a44e

This reverts commit 75a9698.

move brittle tests to pytest xfail

ba7d6fe

kmike reviewed Oct 2, 2017

View reviewed changes

expect behaviour of nltk tokenizer

b72bcc1

kmike merged commit f9190c3 into scrapinghub:master Oct 2, 2017

Letter-precise html tokenization #49

Letter-precise html tokenization #49

Uh oh!

Conversation

whalebot-helmsman commented Sep 29, 2017

Uh oh!

codecov bot commented Sep 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike commented Sep 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike Sep 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike commented Sep 29, 2017

Uh oh!

whalebot-helmsman commented Sep 29, 2017

Uh oh!

kmike commented Sep 29, 2017

Uh oh!

whalebot-helmsman commented Sep 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmike commented Oct 2, 2017

Uh oh!

Uh oh!

codecov bot commented Sep 29, 2017 •

edited

Loading

kmike Sep 29, 2017 •

edited

Loading

whalebot-helmsman commented Sep 29, 2017 •

edited

Loading