Skip to content
Philipp Koehn edited this page Sep 29, 2015 · 38 revisions

Welcome to the mtma_bitext wiki!

Linguee Recall Study

Initially, I selected 68 French words that occur 10 times in news2014.fr (I could select many more), and crawled them with Linguee. This gave me 736 URL pairs, along with aligned sentence fragments. The goal of the study is to find out how many of these we crawl with our baseline pipeline, so we have a sense of where we lose the most, and hence we could get the largest gain by better methods.

  • ssh syn
  • cd /home/pkoehn/statmt/project/crawl/linguee
  • ./crawl-lingue.perl LANGUAGE < LANGUAGE-unknown-freq10.word-list
  • ./get-urls-from-linguee.perl LANGUAGE > LANGUAGE-unknown-freq10.info

Runs:

  • 68 French words, 736 URL pairs
  • 660 French words, 6424 URL pairs, 2171 unique web domains

PDF files

245 of the 736 URLs point to pdf files. These are not in CommonCrawl.

  • Total number of URLs: 736 (646 unique)
  • Loss: 33%*
  • Remaining: 491 (459 unique)

Can we get all the Linguee data from CommonCrawl?

This analysis is based on the first run with 68 French words, resulting in 736 URL pairs.

Do we have the URLs in CommonCrawl?

Use the fancy querying interface http://statmt.org:8030/query_domain?domain=URL

  • Starting point: 456 URL pairs (note: just not all processed)
  • Found English URLs: 36 URLs (92% loss)
  • Found French URLs: 23 URLs (95% loss)
  • Found both URLs: 13 URLs (97% loss)

So, that's not so good.

This is currently the main bottleneck. We probably have to crawl ourselves, and use CommonCrawl only as means to find promising sites.

Do we have any pages from the same web domain in CommonCrawl?

  • Starting point: 491 French URLs (246 unique web domains)
  • Loss: 15%
  • Found domain in CommonCrawl: 417 URLs (184 unique web domains)

Do we select the web domains in the baseline pipeline related to the URLs?

The baseline pipeline here is URL matching.

  • Starting point: 490 (458 unique, 165 web domains)
  • Loss: 67%
  • Remaining: 164 URLs (50? unique web domains)

Component Testing: Document Alignment and Sentence Alignment in Bitextor

For now, the starting point are the URLs from Linguee. We first check if these are still alive, so we crawl them. For the ones for which we downloaded HTML documents, we check if the French contains the matched word - basic but reliable sanity check, since the French words are rare. We lose some URL pairs because the pages are Latin-1 encoded and grepping for the matched word (which is UTF-8) fails.

  • download-urls.perl LANGUAGE
  • check-downloaded-html-for-key-word.perl LANGUAGE (creates LANGUAGE-*.crawl-check)
  • URL pairs: 6424 (4917 unique)
  • URL pairs that are not PDF: 4473 (3573 unique, 1496 unique web domains)
  • Crawled with non-empty response: 2974 URL pairs (2325 unique, 879 unique web domains)
  • Successfully crawled: 1761 URL pairs (1375 unique, 448 unique web domains)
  • Loss: 61% (62%, 71%)

For these 1761 URLs, we completely crawl the 448 web domains (using Bitextor / httrack).

Of these 448 unique domains:

  • we excluded some since they are way too big: Canadian Parliament (265 URLs) and Europarl/europa.eu (253 URLs).
  • we excluded some because the French and the English have different domains

This leaves 389 domains.

Are aligned web pages under the same web domain?

Most, yes. Over 90%. And the rest are similar.

Document alignment

The task of document alignment is to find the URL pair (that we know from Linguee) in the full crawl of the web site.

  • crawl-single-domain-linguee-matches.perl LANGUAGE
  • check-linguee-matches-in-site-downloads.perl LANGUAGE > result-check-linguee-matches-in-site-downloads-LANGUAGE
  • stage 1 Starting point: 1761 URL pairs (1375 unique, 448 unique web domains)
  • stage 2 Different web domains for source and target removed: 1591 URL pairs (1245 unique, 395 unique web domains)
  • stage 3 Big web domains removed: 881 URL pairs (741 unique, 389 unique web domains)
  • stage 4 Crawling completed: 872 URL pairs --- temporary loss
  • grep -v ^CH result-check-linguee-matches-in-site-downloads-LANGUAGE | wc
  • Both URLs found in domain: 490 URL pairs
  • run-bitextor-on-all-domains.perl LANGUAGE
  • Bitextor finished document alignment: 446 URL pairs --- temporary loss

Task definition:

  • Given the site crawls in /home/pkoehn/statmt/project/crawl/data/site-crawls (Valhalla)
  • Align the web pages for each site
  • Answer key: grep -v ^C /home/pkoehn/statmt/project/crawl/data/result-check-linguee-matches-in-site-downloads-LANGUAGE

Bitextor performance:

  • evaluate-bitextor-document-align.perl
  • Correct: 192 (43%)
  • Wrongly aligned: 52 (12%)
  • Not aligned: 202 (45%)

Bitextor misses obvious URL patterns:

The 'unaligned' may be due to boilerplate removal, which removes everything and then de-duplicates pages.

Error sources

Only documents that are detected to be authored in different languages may be aligned.

Setup: Extract text from both sites (only preprocessing unicode normalization/sanatization), classify text spans, remove spans that are not en/fr, take most common language as 'document language'. If both pages are classified to be of the same language that's a loss, otherwise a win.

  • One document classified as EN the other as FR: 474 (96.7%)
  • Both in the same language: 16 (3.3%)

Note: Bitextor might be better or worse due to tika/boilerpipe and document level language classification as opposed to classifying spans. We can get that information from the .lett files.

URL Matching

Using exactly the same stripping as in the "Dirt Cheap"-paper we match 162 (33%) pairs, after removing "//" from paths 174 (35.5%). The latter matches correctly the case of bla.com/index and bla.com/fr/index.

Attention: This is by matching the original URLs that are kept in a comment of the end of the downloaded HTML files.

Sentence Alignment

Given document pairs from Linguee, can we extract the same sentence pair?

Starting point are the URLs for which we crawl valid HTML pages from the web. On these we run the Bitextor sentence alignment pipeline.

/home/pkoehn/statmt/project/crawl/linguee/evaluate-bitextor-sentence-aligment.perl

Evaluation gives full credit for cases where sentence fragments are partial matches, e.g.,

  • Linguee: The big man has a funny nose.
  • Bitextor: The big man has a funny nose. Really.

Bitextor performance:

  • Correctly aligned: 136/212 (64%)

Distance with Scoring Bigrams (a la Google):

Pipeline:

  1. Collect all files from a crawl directory and extract those that are HTML files. Similar to Bitextors webdir2ett:

    find mammusique.com -type f -exec file -N --mime-type --mime-encoding {} + | /bin/grep -E "(text|html|xml)" > mammusique.com.files

  2. Determine main language for each document by parsing HTML, UTF-8 normalization/sanitization, extraction of spans in different languages and then picking the most common one if it is in the list of languages we're looking for.

    python /home/buck/net/build/mtma_bitext/baseline/checklang.py -annotate mammusique.com.files mammusique.com.languages

this produces a file of format "filenamelang". This file is used to determine the two sides of the bipartite matching graph.

  1. Extract, for each English and each French file the English text using the pipeline from 2) and for each French file the French text that is to be translated. This step already performs sentence splitting and, for English, normalization and tokenization using moses scripts. Since this is more efficient to do later for the French part, tokenization and normalization are deactivated during French text extraction.

    cat mammusique.com.files | python ~/net/build/mtma_bitext/baseline/extract_foreign_text.py -o mammusique.com.keep -prefix=/fs/gna0/buck/cc/linguee/site-crawls/mammusique.com/ -lang en cat mammusique.com.files | python /home/buck/net/build/mtma_bitext/baseline/extract_foreign_text.py -o mammusique.com.translate -prefix=/fs/gna0/buck/cc/linguee/site-crawls/mammusique.com/ -lang fr -tokenizer="" -normalizer=""

Here is idea is, that even the French pages will contain some English that we want to use in matching. It's most likely boilerplate but may help when comparing pages for very different parts of a website.

The fileformat is:

`filename<TAB>sentence`
  1. Copy DOMAIN.translate file to CLSP cluster and translate with moses:

    cd /home/cbuck/b07/en-fr cut -f 2 mammusique.com.translate | /home/pkoehn/moses/scripts/tokenizer/normalize-punctuation.perl fr | /home/pkoehn/moses/scripts/tokenizer/tokenizer.perl -l fr | /home/pkoehn/moses/scripts/recaser/truecase.perl --model /home/pkoehn/experiment/crawltest-fr-en/truecaser/truecase-model.3.fr | /home/pkoehn/moses/bin/moses.2015-03-23 -f moses.tuned.ini.7 -threads 30 | /home/pkoehn/moses/scripts/recaser/detruecase.perl > mammusique.com.translated

Note that we don't detokenize.

  1. Copy file DOMAIN.translated back to Edin and add the first column (the filenames) again and extract n-grams:

    paste <(cut -f 1 mammusique.com.translate) mammusique.com.translated | python /home/buck/net/build/DataCollection/baseline/ngrams.py -n 4 > mammusique.com.tngrams

These are the 'translated n-grams' those generated by translation. This example uses 4-gram but we should use bigrams as in the Google paper.

  1. Extract n-gram for English segments (make sure data is tokenized in the same way as the translated data) as well:

    cat mammusique.com.keep | python /home/buck/net/build/DataCollection/baseline/ngrams.py -n 4 | sort > mammusique.com.engrams

  2. Compute idf-weighted cosine distance between all source and target documents (we skip the 5-gram based matching step for now):

    python /home/buck/net/build/DataCollection/baseline/score_ngrams.py mammusique.com.engrams mammusique.com.tngrams mammusique.com.languages -outfile mammusique.com.matches

Output format is:

 `source_file<TAB>target_file<TAB>cosine distance.`