second-stage/README

~mj/research/rerank/README 

(c) Mark Johnson, 24th April 2005
Mark_Johnson@Brown.edu

This directory contains programs for reranking the parses produced by
Michael Collin's generative parser.  This is research software, and so
is under constant development and change, and is not as user-friendly
or as well documented as might be desired.

These programs are (c) Mark Johnson, but may be used freely by anyone
for any purpose.  If you use my code, I request acknowledgement both
in any programs that contain my code, and in any research papers or
articles that contain results obtained using this code.

Permission is hereby granted, free of charge, to any person or organization
obtaining a copy of the software and accompanying documentation covered by
this license (the "Software") to use, reproduce, display, distribute,
execute, and transmit the Software, and to prepare derivative works of the
Software, and to permit third-parties to whom the Software is furnished to
do so, all subject to the following:

The copyright notices in the Software and this entire statement, including
the above license grant, this restriction and the following disclaimer,
must be included in all copies of the Software, in whole or in part, and
all derivative works of the Software, unless such copies or derivative
works are solely in the form of machine-executable object code generated by
a source language processor.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.


INSTALLATION
============

This software uses g++ and gcc 4.0; it may also function using earlier
versions of g++ and gcc.

A single "make" command in the top-level directory should run the feature
extractor on the trees, run the training program to estimate the model,
and runs the evaluation program to evaluate model's predictions.


DIRECTORY STRUCTURE
===================

ec/ contains Eugene Charniak's original n-best data 

collins/ contains Michael Collin's original n-best data

trees/ contains the trees extracted from Eugene's and Michael's data 
(see the README file in this directory for more information).

notes/ contains notes on various parts of this process.

The top-level Makefile defines variables FEATDIR and MODELDIR, which
name the directories into which the extracted features and the
estimated model is written.  See below for more information.

programs/ contains the code for doing all of this stuff

programs/prepare-data contains programs for converting Charniak's and
        Collins' parser	output into the format my programs expect (e.g., 
        into treebank format trees)

programs/features contains programs for extracting features from
trees.  These are the programs to modify if you want to experiment
with new features.

programs/wlle contains the feature weight estimation programs.  These
are the programs to modify if you want to experiment with new
estimation (i.e., ``learning'') procedures.

programs/features-best contains the version of program/features that
is currently producing the best results.  Right now this is a very old
version of the program which is not fully compatible with all of the
estimation procedures.

In programs/features/
  
  spfeatures.h extracts features from trees (define new features here)
  extract-spfeatures.cc actually does feature extraction
  sp-data.h new program for reading the tree data files
  oracle-score.cc program for computing oracle parsing performance
  heads.cc/h head-finding code
  read-tree.l reads treebank trees
  dp-data.h old program for reading the tree data files
  features.h old set of features

In programs/wlle/

  lm.cc program for estimating linear models, with lots of options for
      using different learning procedures (see lm.cc for details)

  cvlm.cc program for estimating linear models that cross-validates regularizer
      weights using heldout data.  Not clear if it produces better results than
      lm.cc

  lmdata.c/h reads feature-count data file and computes basic statistics

  data.c/h reads feature-count data file and computes basic statistics (old version
    of lmdata.h)

  avper.cc averaged perceptron (old)
  cvwlle.cc MaxEnt estimator which tunes regularizer weights on heldout data (old)
  lle.cc MaxEnt estimator (old)
  lnne.cc Neural net model (perhaps buggy)
  oracle.cc computes oracle performance
  powell.h Powell's minimization algorithm
  tao-optimizer.h Interface to PETSc/TAO
  wavper.cc averaged perceptron which tunes feature scale factors on heldout data (old)
  wlle.cc Regularized Maxent estimator (old)

Note: if you change programs in programs/features and/or
programs/wlle, you should rerun make in those directories.  A
top-level make does not always correctly rebuild programs in these
embedded directories.


FEATURE AND MODEL FILES
=======================

Because it's useful to be able to experiment with several different
kinds of features and models at the same time, features and models are
stored in directories with user-specified names given in the top-level
Makefile.  $(FEATDIR) is the directory where the feature counts are
stored, and $(FEATDIR)/$(MODELDIR) is where estimated models are stored.

Specifically, the feature extractor program constructs the following
files.

$(FEATDIR)/train.bz2 contains the feature counts from the training data
$(FEATDIR)/dev.bz2 contains the feature counts from the development data
$(FEATDIR)/train-feat.bz2 contains information about each feature.

The feature weight estimator programs read these files, and write
files

$(FEATDIR)/$(MODELDIR)/train-weights  the weights assigned to features
$(FEATDIR)/$(MODELDIR)/train-weights.eval  the evaluation of this model
$(FEATDIR)/$(MODELDIR)/dev-parses.bz2 sample parses, with errors indicated


Data format for input to feature weight estimators:
--------------------------------------------------

All of the feature weight estimators take feature-counts, rather than
parse trees, as input data.  The trees must be converted to
feature-counts by programs such as extract-features.

<Data>     --> [S=<NS>] <Sentence>*
<Sentence> --> [G=<G>] N=<N> <Parse>*
<Parse>    --> [P=<P>] [W=<W>] <FC>*,
<FC>       --> <F>[=<C>]

NS is the number of sentences.

Each <Sentence> consists of N <Parse>s.  <G> is the number of edges
in the gold standard parse.

A <Parse> consists of an optional <P> score (the number of edges in
the parse) and a <W> score (the number of edges in common between
the gold standard and this parse) and zero or more <FC> feature-count
pairs.  

A <FC> consists of a feature (a non-negative integer) and an optional
count (a postive integer).

The default for all numbers except <W> is 1.  The default for <W> is 0.


--------------------------------------------------------------

Michael Collins' training set starts at tree number 3914 in
wsj_0200.mrg and ends at tree 40025 in wsj_1999.mrg.

The training set contains parses for 35540 of the 36112 trees in
this part of the corpus (this is 98.4% of the trees).

On the trees he parsed, oracle precision = 618325/646298 = 0.956718,
recall = 618325/646573 = 0.956718, f-score = 0.956515.

His devset starts at tree number 40026 in wsj_1999.mrg and ends at
tree 43745 in wsj_2172.mrg.

The devset contains parses for 3676 of the 3720 trees in this part of
the corpus (this is 98.8% of the trees).

On the trees he parsed, oracle precision = 63832/66901 = 0.954126,
recall = 63832/66889 = 0.954126, f-score = 0.954212.

---------------------------------------------------------------

Here are the results of a run on phobos make-ec50-spbest-f24-12-08.out 

phobos [31] % cat make-ec50-spbest-f24-12-08.out
mkdir -p trees
mkdir -p trees/ec-50-best
zcat ec/50-best/23.txt.gz | programs/prepare-data/prepare-ec-data 0 2416 | bzip2 -cz > trees/ec-50-best/f23.bz2


# The PTB corpus contained 2416 sentences.
# There were 120497 parses for 2416 different sentences, averaging 49.8746 parses per sentence.

mkdir -p trees
mkdir -p trees/ec-50-best
zcat ec/50-best/24.txt.gz | programs/prepare-data/prepare-ec-data 0 1345 | bzip2 -cz > trees/ec-50-best/f24.bz2


# The PTB corpus contained 1345 sentences.
# There were 67108 parses for 1345 different sentences, averaging 49.8944 parses per sentence.

mkdir -p ec50-spbest-ic-s5
programs/features-best/extract-spfeatures -i -c -s 5 trees/ec-50-best/traindev.bz2 ec50-spbest-ic-s5/traindev.gz \
                                 trees/ec-50-best/f23.bz2 ec50-spbest-ic-s5/f23.gz \
                                 trees/ec-50-best/f24.bz2 ec50-spbest-ic-s5/f24.gz \
                         | gzip > ec50-spbest-ic-s5/traindev-feat.gz
# debug_level (-d) = 0, absolute_counts (-a) = 0, collect_correct (-c) = 1, collect_gold (-g) = 0, collect_incorrect (-i) = 1, mincount (-s) = 5, lowercase_flag (-l) = 0
# maxid = 1148697, usage utime 7079.15s, vsize 740.129 Mb.
# reading from trees/ec-50-best/traindev.bz2, writing to ec50-spbest-ic-s5/traindev.gz, usage utime 30305.1s, vsize 740.137 Mb.
# reading from trees/ec-50-best/f23.bz2, writing to ec50-spbest-ic-s5/f23.gz, usage utime 31672.8s, vsize 740.137 Mb.
# reading from trees/ec-50-best/f24.bz2, writing to ec50-spbest-ic-s5/f24.gz, usage utime 32485.2s, vsize 740.137 Mb.
make -C programs wlle/cvlm
make[1]: Entering directory `/home/mj/research/rerank/programs'
make -C wlle cvlm
make[2]: Entering directory `/home/mj/research/rerank/programs/wlle'
make[2]: `cvlm' is up to date.
make[2]: Leaving directory `/home/mj/research/rerank/programs/wlle'
make[1]: Leaving directory `/home/mj/research/rerank/programs'
mkdir -p ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1
zcat ec50-spbest-ic-s5/traindev.gz | programs/wlle/cvlm -l 1 -c0 20 -Pyx_factor 1 -debug 10 -e ec50-spbest-ic-s5/f24.gz \
                -o ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights | tee ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-run.out
#  ltype = 1 regularization c0 = 20, power p = 2, scale s = 1; random init r = 0, constrained var optimization cv = 0, feature_count_base fcb = 0, Pyx_factor = 1, Px_propto_g = 0
# round ccs     L       R       Q       neglogP f-score
1       (20)    95506.6 9311.15 104818  5035.49 0.901791
2       (54.3656)       106060  6756.63 112817  5024.12 0.902479
3       (274.175)       117316  3880.61 121196  5126.63 0.899998
4       (100.863)       111118  5460.38 116578  5053.22 0.902099
5       (37.1055)       102383  7680.16 110063  5017.73 0.902632
6       (29.3032)       99890.6 8288.1  108179  5019.88 0.902379
7       (41.2915)       103459  7410.74 110869  5018.44 0.902711
8       (42.7844)       103806  7324.66 111131  5018.83 0.902322
9       (40.5854)       103287  7453.81 110741  5018.22 0.902811
10      (39.2193)       102944  7540.35 110485  5017.97 0.902754
11      (39.8982)       103117  7496.17 110614  5018.06 0.902754
# Regularizer class weights = (40.5854)
rm -f ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz
gzip ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights
make -C programs eval-weights/eval-weights
make[1]: Entering directory `/home/mj/research/rerank/programs'
make -C eval-weights eval-weights
make[2]: Entering directory `/home/mj/research/rerank/programs/eval-weights'
make[2]: `eval-weights' is up to date.
make[2]: Leaving directory `/home/mj/research/rerank/programs/eval-weights'
make[1]: Leaving directory `/home/mj/research/rerank/programs'
zcat ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz \
                | programs/eval-weights/eval-weights  ec50-spbest-ic-s5/traindev-feat.gz ec50-spbest-ic-s5/f23.gz \
                | tee ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz.eval
# All 1148697 features:  f-score = 0.910213, -log P = 5407.56, 1148697 nonzero, mean w = 0.00016022, sd w = 0.000163536
-0.0166332      2165.17 1       1       0.43162 inf     LogProb
-0.000257641    10.9306 38544   38544   0.000296886     0.000101007     NGramTree:2:3:1:0
8.56742e-06     21.5827 39172   39172   0.000161906     0.000222667     NGramTree:2:2:1:0
-0.000168956    21.9148 97727   97727   0.000187439     0.000120165     NGramTree:3:2:1:0
-0.000847289    9.61796 28090   28090   0.000105056     0.000209352     NGramTree:2:0:1:0
-0.000346323    15.8889 88376   88376   0.000143591     0.000127728     NGramTree:3:0:1:0
-0.000339995    17.3621 26655   26655   8.5943e-05      0.000166026     HeadTree:0:0:0:0
-0.000212279    19.7646 45516   45516   0.000299874     0.000125844     HeadTree:0:1:0:0
-0.0038706      205.632 2       2       -0.586417       5.83308e-05     RightBranch
-3.16409e-05    19.2862 10703   10703   7.5864e-05      0.000277447     Rule:0:0:0:0:0:0:0:0
0.000741502     25.936  62550   62550   0.00011313      0.000146618     Rule:1:0:0:0:0:0:0:0
0.000302594     9.23864 18092   18092   4.98028e-05     0.000174426     Rule:0:1:0:0:0:0:0:0
0.000196315     1.36308 30662   30662   4.89151e-05     0.000147053     Rule:0:2:0:0:0:0:0:0
-0.000121179    4.8134  14232   14232   5.79989e-05     0.000216266     Rule:0:0:1:0:0:0:0:0
0.000212739     13.1389 73754   73754   0.00010138      0.00012467      Rule:1:0:1:0:0:0:0:0
0.000923552     2.0974  23480   23480   3.03548e-05     0.000164768     Rule:0:1:1:0:0:0:0:0
-0.000964881    4.09122 38182   38182   5.28909e-05     0.000165231     Rule:0:2:1:0:0:0:0:0
-0.000886551    9.21124 28941   28941   0.000174083     0.000140683     Heads:2:0:1:1
0.000569429     7.70173 41615   41615   0.000212104     0.000163996     Heads:3:0:1:0
0.000336315     4.3478  43832   43832   0.000249311     0.000156749     Heads:3:0:1:1
-6.17926e-05    9.8394  28209   28209   0.000291801     0.000189673     Heads:2:1:0:0
9.88537e-05     10.5884 39347   39347   0.000303031     0.000155084     Heads:2:1:0:1
-0.000115436    5.909   26655   26655   0.000300776     0.000152144     Heads:3:1:0:0
-0.000313323    5.8565  14980   14980   0.000228911     0.000124417     Word:2
-0.000610164    22.3727 34117   34117   0.000158534     0.000171613     Word:3
-0.00100201     30.3573 1049    1049    -0.00111895     0.00217447      Heavy
-0.000428516    23.7398 8146    8146    2.88531e-05     0.000569008     Neighbours:1:1
0.000294831     8.13289 30099   30099   5.31378e-05     0.000190449     Neighbours:2:1
-8.2435e-05     3.88512 2299    2299    -0.000139295    0.000383703     SynSemHeads:1
-0.000933757    30.5972 5795    5795    6.83546e-05     0.000943204     NGram:2:0:0:0
-0.000212052    16.9636 15858   15858   7.72061e-05     0.000352206     NGram:2:1:0:0
0.00034092      0.377593        32914   32914   6.72999e-05     0.000259938     NGram:2:2:0:0
-0.000136199    0.0070037       10      10      0.0772046       0.0207822       CoPar
3.95421e-05     5.2718  22      22      0.0108412       0.0115617       CoLenPar
-0.000171115    0.828648        300     300     8.22696e-05     0.000500958     SubjVerbAgr
1.67135e-05     6.59811 35426   35426   0.000194276     0.000111449     WProj:101
-5.77172e-05    5.67156 38921   38921   0.000164758     0.000110965     WProj:001
-0.000206155    3.48344 16489   16489   0.000191317     0.000113536     WProj:110
-0.000105115    2.65961 15964   15964   0.000137918     0.000110958     WProj:010
0.00018989      5.86012 25181   25181   0.000215002     0.000121698     WProj:111
-0.00027011     5.51061 26790   26790   0.000175404     0.000122975     WProj:011
zcat ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz \
| programs/eval-weights/eval-weights  ec50-spbest-ic-s5/traindev-feat.gz ec50-spbest-ic-s5/f24.gz \
        | tee ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz.eval
# All 1148697 features:  f-score = 0.902754, -log P = 3257.89, 1148697 nonzero, mean w = 0.00016022, sd w = 0.000163536
-0.0196865      1066.83 1       1       0.43162 inf     LogProb
-0.000484599    5.01254 38544   38544   0.000296886     0.000101007     NGramTree:2:3:1:0
-0.000405227    7.29741 39172   39172   0.000161906     0.000222667     NGramTree:2:2:1:0
0.000494782     4.05913 97727   97727   0.000187439     0.000120165     NGramTree:3:2:1:0
-0.000628317    5.95734 28090   28090   0.000105056     0.000209352     NGramTree:2:0:1:0
-0.000949176    1.51015 88376   88376   0.000143591     0.000127728     NGramTree:3:0:1:0
0.000392231     9.13379 26655   26655   8.5943e-05      0.000166026     HeadTree:0:0:0:0
8.60788e-05     7.80756 45516   45516   0.000299874     0.000125844     HeadTree:0:1:0:0
-0.00472015     100.028 2       2       -0.586417       5.83308e-05     RightBranch
-0.000207022    12.359  10703   10703   7.5864e-05      0.000277447     Rule:0:0:0:0:0:0:0:0
-0.000731421    19.1023 62550   62550   0.00011313      0.000146618     Rule:1:0:0:0:0:0:0:0
0.000283509     7.18216 18092   18092   4.98028e-05     0.000174426     Rule:0:1:0:0:0:0:0:0
0.000218823     -2.49168        30662   30662   4.89151e-05     0.000147053     Rule:0:2:0:0:0:0:0:0
-0.000948468    -0.173411       14232   14232   5.79989e-05     0.000216266     Rule:0:0:1:0:0:0:0:0
-0.000514399    8.51554 73754   73754   0.00010138      0.00012467      Rule:1:0:1:0:0:0:0:0
-0.000151691    2.59514 23480   23480   3.03548e-05     0.000164768     Rule:0:1:1:0:0:0:0:0
-0.000215256    -5.06602        38182   38182   5.28909e-05     0.000165231     Rule:0:2:1:0:0:0:0:0
0.000564362     3.61758 28941   28941   0.000174083     0.000140683     Heads:2:0:1:1
-0.000643008    3.50995 41615   41615   0.000212104     0.000163996     Heads:3:0:1:0
-2.53376e-05    0.471316        43832   43832   0.000249311     0.000156749     Heads:3:0:1:1
0.000188525     4.67735 28209   28209   0.000291801     0.000189673     Heads:2:1:0:0
1.38927e-05     3.26096 39347   39347   0.000303031     0.000155084     Heads:2:1:0:1
-0.000266945    4.28265 26655   26655   0.000300776     0.000152144     Heads:3:1:0:0
-0.000303777    0.383031        14980   14980   0.000228911     0.000124417     Word:2
0.000215015     3.70997 34117   34117   0.000158534     0.000171613     Word:3
-0.00257331     11.434  1049    1049    -0.00111895     0.00217447      Heavy
-0.002183       21.7296 8146    8146    2.88531e-05     0.000569008     Neighbours:1:1
-0.000143029    4.62153 30099   30099   5.31378e-05     0.000190449     Neighbours:2:1
-0.000168376    2.09884 2299    2299    -0.000139295    0.000383703     SynSemHeads:1
-0.00160916     30.1561 5795    5795    6.83546e-05     0.000943204     NGram:2:0:0:0
0.000399834     8.73101 15858   15858   7.72061e-05     0.000352206     NGram:2:1:0:0
0.000256753     -2.89172        32914   32914   6.72999e-05     0.000259938     NGram:2:2:0:0
-0.000327696    1.09118 10      10      0.0772046       0.0207822       CoPar
1.08069e-06     2.49936 22      22      0.0108412       0.0115617       CoLenPar
3.9228e-05      1.09613 300     300     8.22696e-05     0.000500958     SubjVerbAgr
-1.77063e-05    0.674698        35426   35426   0.000194276     0.000111449     WProj:101
0.000245469     1.38037 38921   38921   0.000164758     0.000110965     WProj:001
3.9228e-05      2.3027  16489   16489   0.000191317     0.000113536     WProj:110
-8.2274e-05     3.47672 15964   15964   0.000137918     0.000110958     WProj:010
1.00781e-05     1.16501 25181   25181   0.000215002     0.000121698     WProj:111
6.07482e-05     1.80004 26790   26790   0.000175404     0.000122975     WProj:011