forked from dmcc/bllip-parser
-
Notifications
You must be signed in to change notification settings - Fork 53
/
README
351 lines (287 loc) · 19.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
~mj/research/rerank/README
(c) Mark Johnson, 24th April 2005
Mark_Johnson@Brown.edu
This directory contains programs for reranking the parses produced by
Michael Collin's generative parser. This is research software, and so
is under constant development and change, and is not as user-friendly
or as well documented as might be desired.
These programs are (c) Mark Johnson, but may be used freely by anyone
for any purpose. If you use my code, I request acknowledgement both
in any programs that contain my code, and in any research papers or
articles that contain results obtained using this code.
Permission is hereby granted, free of charge, to any person or organization
obtaining a copy of the software and accompanying documentation covered by
this license (the "Software") to use, reproduce, display, distribute,
execute, and transmit the Software, and to prepare derivative works of the
Software, and to permit third-parties to whom the Software is furnished to
do so, all subject to the following:
The copyright notices in the Software and this entire statement, including
the above license grant, this restriction and the following disclaimer,
must be included in all copies of the Software, in whole or in part, and
all derivative works of the Software, unless such copies or derivative
works are solely in the form of machine-executable object code generated by
a source language processor.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
INSTALLATION
============
This software uses g++ and gcc 4.0; it may also function using earlier
versions of g++ and gcc.
A single "make" command in the top-level directory should run the feature
extractor on the trees, run the training program to estimate the model,
and runs the evaluation program to evaluate model's predictions.
DIRECTORY STRUCTURE
===================
ec/ contains Eugene Charniak's original n-best data
collins/ contains Michael Collin's original n-best data
trees/ contains the trees extracted from Eugene's and Michael's data
(see the README file in this directory for more information).
notes/ contains notes on various parts of this process.
The top-level Makefile defines variables FEATDIR and MODELDIR, which
name the directories into which the extracted features and the
estimated model is written. See below for more information.
programs/ contains the code for doing all of this stuff
programs/prepare-data contains programs for converting Charniak's and
Collins' parser output into the format my programs expect (e.g.,
into treebank format trees)
programs/features contains programs for extracting features from
trees. These are the programs to modify if you want to experiment
with new features.
programs/wlle contains the feature weight estimation programs. These
are the programs to modify if you want to experiment with new
estimation (i.e., ``learning'') procedures.
programs/features-best contains the version of program/features that
is currently producing the best results. Right now this is a very old
version of the program which is not fully compatible with all of the
estimation procedures.
In programs/features/
spfeatures.h extracts features from trees (define new features here)
extract-spfeatures.cc actually does feature extraction
sp-data.h new program for reading the tree data files
oracle-score.cc program for computing oracle parsing performance
heads.cc/h head-finding code
read-tree.l reads treebank trees
dp-data.h old program for reading the tree data files
features.h old set of features
In programs/wlle/
lm.cc program for estimating linear models, with lots of options for
using different learning procedures (see lm.cc for details)
cvlm.cc program for estimating linear models that cross-validates regularizer
weights using heldout data. Not clear if it produces better results than
lm.cc
lmdata.c/h reads feature-count data file and computes basic statistics
data.c/h reads feature-count data file and computes basic statistics (old version
of lmdata.h)
avper.cc averaged perceptron (old)
cvwlle.cc MaxEnt estimator which tunes regularizer weights on heldout data (old)
lle.cc MaxEnt estimator (old)
lnne.cc Neural net model (perhaps buggy)
oracle.cc computes oracle performance
powell.h Powell's minimization algorithm
tao-optimizer.h Interface to PETSc/TAO
wavper.cc averaged perceptron which tunes feature scale factors on heldout data (old)
wlle.cc Regularized Maxent estimator (old)
Note: if you change programs in programs/features and/or
programs/wlle, you should rerun make in those directories. A
top-level make does not always correctly rebuild programs in these
embedded directories.
FEATURE AND MODEL FILES
=======================
Because it's useful to be able to experiment with several different
kinds of features and models at the same time, features and models are
stored in directories with user-specified names given in the top-level
Makefile. $(FEATDIR) is the directory where the feature counts are
stored, and $(FEATDIR)/$(MODELDIR) is where estimated models are stored.
Specifically, the feature extractor program constructs the following
files.
$(FEATDIR)/train.bz2 contains the feature counts from the training data
$(FEATDIR)/dev.bz2 contains the feature counts from the development data
$(FEATDIR)/train-feat.bz2 contains information about each feature.
The feature weight estimator programs read these files, and write
files
$(FEATDIR)/$(MODELDIR)/train-weights the weights assigned to features
$(FEATDIR)/$(MODELDIR)/train-weights.eval the evaluation of this model
$(FEATDIR)/$(MODELDIR)/dev-parses.bz2 sample parses, with errors indicated
Data format for input to feature weight estimators:
--------------------------------------------------
All of the feature weight estimators take feature-counts, rather than
parse trees, as input data. The trees must be converted to
feature-counts by programs such as extract-features.
<Data> --> [S=<NS>] <Sentence>*
<Sentence> --> [G=<G>] N=<N> <Parse>*
<Parse> --> [P=<P>] [W=<W>] <FC>*,
<FC> --> <F>[=<C>]
NS is the number of sentences.
Each <Sentence> consists of N <Parse>s. <G> is the number of edges
in the gold standard parse.
A <Parse> consists of an optional <P> score (the number of edges in
the parse) and a <W> score (the number of edges in common between
the gold standard and this parse) and zero or more <FC> feature-count
pairs.
A <FC> consists of a feature (a non-negative integer) and an optional
count (a postive integer).
The default for all numbers except <W> is 1. The default for <W> is 0.
--------------------------------------------------------------
Michael Collins' training set starts at tree number 3914 in
wsj_0200.mrg and ends at tree 40025 in wsj_1999.mrg.
The training set contains parses for 35540 of the 36112 trees in
this part of the corpus (this is 98.4% of the trees).
On the trees he parsed, oracle precision = 618325/646298 = 0.956718,
recall = 618325/646573 = 0.956718, f-score = 0.956515.
His devset starts at tree number 40026 in wsj_1999.mrg and ends at
tree 43745 in wsj_2172.mrg.
The devset contains parses for 3676 of the 3720 trees in this part of
the corpus (this is 98.8% of the trees).
On the trees he parsed, oracle precision = 63832/66901 = 0.954126,
recall = 63832/66889 = 0.954126, f-score = 0.954212.
---------------------------------------------------------------
Here are the results of a run on phobos make-ec50-spbest-f24-12-08.out
phobos [31] % cat make-ec50-spbest-f24-12-08.out
mkdir -p trees
mkdir -p trees/ec-50-best
zcat ec/50-best/23.txt.gz | programs/prepare-data/prepare-ec-data 0 2416 | bzip2 -cz > trees/ec-50-best/f23.bz2
# The PTB corpus contained 2416 sentences.
# There were 120497 parses for 2416 different sentences, averaging 49.8746 parses per sentence.
mkdir -p trees
mkdir -p trees/ec-50-best
zcat ec/50-best/24.txt.gz | programs/prepare-data/prepare-ec-data 0 1345 | bzip2 -cz > trees/ec-50-best/f24.bz2
# The PTB corpus contained 1345 sentences.
# There were 67108 parses for 1345 different sentences, averaging 49.8944 parses per sentence.
mkdir -p ec50-spbest-ic-s5
programs/features-best/extract-spfeatures -i -c -s 5 trees/ec-50-best/traindev.bz2 ec50-spbest-ic-s5/traindev.gz \
trees/ec-50-best/f23.bz2 ec50-spbest-ic-s5/f23.gz \
trees/ec-50-best/f24.bz2 ec50-spbest-ic-s5/f24.gz \
| gzip > ec50-spbest-ic-s5/traindev-feat.gz
# debug_level (-d) = 0, absolute_counts (-a) = 0, collect_correct (-c) = 1, collect_gold (-g) = 0, collect_incorrect (-i) = 1, mincount (-s) = 5, lowercase_flag (-l) = 0
# maxid = 1148697, usage utime 7079.15s, vsize 740.129 Mb.
# reading from trees/ec-50-best/traindev.bz2, writing to ec50-spbest-ic-s5/traindev.gz, usage utime 30305.1s, vsize 740.137 Mb.
# reading from trees/ec-50-best/f23.bz2, writing to ec50-spbest-ic-s5/f23.gz, usage utime 31672.8s, vsize 740.137 Mb.
# reading from trees/ec-50-best/f24.bz2, writing to ec50-spbest-ic-s5/f24.gz, usage utime 32485.2s, vsize 740.137 Mb.
make -C programs wlle/cvlm
make[1]: Entering directory `/home/mj/research/rerank/programs'
make -C wlle cvlm
make[2]: Entering directory `/home/mj/research/rerank/programs/wlle'
make[2]: `cvlm' is up to date.
make[2]: Leaving directory `/home/mj/research/rerank/programs/wlle'
make[1]: Leaving directory `/home/mj/research/rerank/programs'
mkdir -p ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1
zcat ec50-spbest-ic-s5/traindev.gz | programs/wlle/cvlm -l 1 -c0 20 -Pyx_factor 1 -debug 10 -e ec50-spbest-ic-s5/f24.gz \
-o ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights | tee ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-run.out
# ltype = 1 regularization c0 = 20, power p = 2, scale s = 1; random init r = 0, constrained var optimization cv = 0, feature_count_base fcb = 0, Pyx_factor = 1, Px_propto_g = 0
# round ccs L R Q neglogP f-score
1 (20) 95506.6 9311.15 104818 5035.49 0.901791
2 (54.3656) 106060 6756.63 112817 5024.12 0.902479
3 (274.175) 117316 3880.61 121196 5126.63 0.899998
4 (100.863) 111118 5460.38 116578 5053.22 0.902099
5 (37.1055) 102383 7680.16 110063 5017.73 0.902632
6 (29.3032) 99890.6 8288.1 108179 5019.88 0.902379
7 (41.2915) 103459 7410.74 110869 5018.44 0.902711
8 (42.7844) 103806 7324.66 111131 5018.83 0.902322
9 (40.5854) 103287 7453.81 110741 5018.22 0.902811
10 (39.2193) 102944 7540.35 110485 5017.97 0.902754
11 (39.8982) 103117 7496.17 110614 5018.06 0.902754
# Regularizer class weights = (40.5854)
rm -f ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz
gzip ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights
make -C programs eval-weights/eval-weights
make[1]: Entering directory `/home/mj/research/rerank/programs'
make -C eval-weights eval-weights
make[2]: Entering directory `/home/mj/research/rerank/programs/eval-weights'
make[2]: `eval-weights' is up to date.
make[2]: Leaving directory `/home/mj/research/rerank/programs/eval-weights'
make[1]: Leaving directory `/home/mj/research/rerank/programs'
zcat ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz \
| programs/eval-weights/eval-weights ec50-spbest-ic-s5/traindev-feat.gz ec50-spbest-ic-s5/f23.gz \
| tee ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz.eval
# All 1148697 features: f-score = 0.910213, -log P = 5407.56, 1148697 nonzero, mean w = 0.00016022, sd w = 0.000163536
-0.0166332 2165.17 1 1 0.43162 inf LogProb
-0.000257641 10.9306 38544 38544 0.000296886 0.000101007 NGramTree:2:3:1:0
8.56742e-06 21.5827 39172 39172 0.000161906 0.000222667 NGramTree:2:2:1:0
-0.000168956 21.9148 97727 97727 0.000187439 0.000120165 NGramTree:3:2:1:0
-0.000847289 9.61796 28090 28090 0.000105056 0.000209352 NGramTree:2:0:1:0
-0.000346323 15.8889 88376 88376 0.000143591 0.000127728 NGramTree:3:0:1:0
-0.000339995 17.3621 26655 26655 8.5943e-05 0.000166026 HeadTree:0:0:0:0
-0.000212279 19.7646 45516 45516 0.000299874 0.000125844 HeadTree:0:1:0:0
-0.0038706 205.632 2 2 -0.586417 5.83308e-05 RightBranch
-3.16409e-05 19.2862 10703 10703 7.5864e-05 0.000277447 Rule:0:0:0:0:0:0:0:0
0.000741502 25.936 62550 62550 0.00011313 0.000146618 Rule:1:0:0:0:0:0:0:0
0.000302594 9.23864 18092 18092 4.98028e-05 0.000174426 Rule:0:1:0:0:0:0:0:0
0.000196315 1.36308 30662 30662 4.89151e-05 0.000147053 Rule:0:2:0:0:0:0:0:0
-0.000121179 4.8134 14232 14232 5.79989e-05 0.000216266 Rule:0:0:1:0:0:0:0:0
0.000212739 13.1389 73754 73754 0.00010138 0.00012467 Rule:1:0:1:0:0:0:0:0
0.000923552 2.0974 23480 23480 3.03548e-05 0.000164768 Rule:0:1:1:0:0:0:0:0
-0.000964881 4.09122 38182 38182 5.28909e-05 0.000165231 Rule:0:2:1:0:0:0:0:0
-0.000886551 9.21124 28941 28941 0.000174083 0.000140683 Heads:2:0:1:1
0.000569429 7.70173 41615 41615 0.000212104 0.000163996 Heads:3:0:1:0
0.000336315 4.3478 43832 43832 0.000249311 0.000156749 Heads:3:0:1:1
-6.17926e-05 9.8394 28209 28209 0.000291801 0.000189673 Heads:2:1:0:0
9.88537e-05 10.5884 39347 39347 0.000303031 0.000155084 Heads:2:1:0:1
-0.000115436 5.909 26655 26655 0.000300776 0.000152144 Heads:3:1:0:0
-0.000313323 5.8565 14980 14980 0.000228911 0.000124417 Word:2
-0.000610164 22.3727 34117 34117 0.000158534 0.000171613 Word:3
-0.00100201 30.3573 1049 1049 -0.00111895 0.00217447 Heavy
-0.000428516 23.7398 8146 8146 2.88531e-05 0.000569008 Neighbours:1:1
0.000294831 8.13289 30099 30099 5.31378e-05 0.000190449 Neighbours:2:1
-8.2435e-05 3.88512 2299 2299 -0.000139295 0.000383703 SynSemHeads:1
-0.000933757 30.5972 5795 5795 6.83546e-05 0.000943204 NGram:2:0:0:0
-0.000212052 16.9636 15858 15858 7.72061e-05 0.000352206 NGram:2:1:0:0
0.00034092 0.377593 32914 32914 6.72999e-05 0.000259938 NGram:2:2:0:0
-0.000136199 0.0070037 10 10 0.0772046 0.0207822 CoPar
3.95421e-05 5.2718 22 22 0.0108412 0.0115617 CoLenPar
-0.000171115 0.828648 300 300 8.22696e-05 0.000500958 SubjVerbAgr
1.67135e-05 6.59811 35426 35426 0.000194276 0.000111449 WProj:101
-5.77172e-05 5.67156 38921 38921 0.000164758 0.000110965 WProj:001
-0.000206155 3.48344 16489 16489 0.000191317 0.000113536 WProj:110
-0.000105115 2.65961 15964 15964 0.000137918 0.000110958 WProj:010
0.00018989 5.86012 25181 25181 0.000215002 0.000121698 WProj:111
-0.00027011 5.51061 26790 26790 0.000175404 0.000122975 WProj:011
zcat ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz \
| programs/eval-weights/eval-weights ec50-spbest-ic-s5/traindev-feat.gz ec50-spbest-ic-s5/f24.gz \
| tee ec50-spbest-ic-s5/cvlm-l1-c20-ns-1-Pyx1/traindev-weights.gz.eval
# All 1148697 features: f-score = 0.902754, -log P = 3257.89, 1148697 nonzero, mean w = 0.00016022, sd w = 0.000163536
-0.0196865 1066.83 1 1 0.43162 inf LogProb
-0.000484599 5.01254 38544 38544 0.000296886 0.000101007 NGramTree:2:3:1:0
-0.000405227 7.29741 39172 39172 0.000161906 0.000222667 NGramTree:2:2:1:0
0.000494782 4.05913 97727 97727 0.000187439 0.000120165 NGramTree:3:2:1:0
-0.000628317 5.95734 28090 28090 0.000105056 0.000209352 NGramTree:2:0:1:0
-0.000949176 1.51015 88376 88376 0.000143591 0.000127728 NGramTree:3:0:1:0
0.000392231 9.13379 26655 26655 8.5943e-05 0.000166026 HeadTree:0:0:0:0
8.60788e-05 7.80756 45516 45516 0.000299874 0.000125844 HeadTree:0:1:0:0
-0.00472015 100.028 2 2 -0.586417 5.83308e-05 RightBranch
-0.000207022 12.359 10703 10703 7.5864e-05 0.000277447 Rule:0:0:0:0:0:0:0:0
-0.000731421 19.1023 62550 62550 0.00011313 0.000146618 Rule:1:0:0:0:0:0:0:0
0.000283509 7.18216 18092 18092 4.98028e-05 0.000174426 Rule:0:1:0:0:0:0:0:0
0.000218823 -2.49168 30662 30662 4.89151e-05 0.000147053 Rule:0:2:0:0:0:0:0:0
-0.000948468 -0.173411 14232 14232 5.79989e-05 0.000216266 Rule:0:0:1:0:0:0:0:0
-0.000514399 8.51554 73754 73754 0.00010138 0.00012467 Rule:1:0:1:0:0:0:0:0
-0.000151691 2.59514 23480 23480 3.03548e-05 0.000164768 Rule:0:1:1:0:0:0:0:0
-0.000215256 -5.06602 38182 38182 5.28909e-05 0.000165231 Rule:0:2:1:0:0:0:0:0
0.000564362 3.61758 28941 28941 0.000174083 0.000140683 Heads:2:0:1:1
-0.000643008 3.50995 41615 41615 0.000212104 0.000163996 Heads:3:0:1:0
-2.53376e-05 0.471316 43832 43832 0.000249311 0.000156749 Heads:3:0:1:1
0.000188525 4.67735 28209 28209 0.000291801 0.000189673 Heads:2:1:0:0
1.38927e-05 3.26096 39347 39347 0.000303031 0.000155084 Heads:2:1:0:1
-0.000266945 4.28265 26655 26655 0.000300776 0.000152144 Heads:3:1:0:0
-0.000303777 0.383031 14980 14980 0.000228911 0.000124417 Word:2
0.000215015 3.70997 34117 34117 0.000158534 0.000171613 Word:3
-0.00257331 11.434 1049 1049 -0.00111895 0.00217447 Heavy
-0.002183 21.7296 8146 8146 2.88531e-05 0.000569008 Neighbours:1:1
-0.000143029 4.62153 30099 30099 5.31378e-05 0.000190449 Neighbours:2:1
-0.000168376 2.09884 2299 2299 -0.000139295 0.000383703 SynSemHeads:1
-0.00160916 30.1561 5795 5795 6.83546e-05 0.000943204 NGram:2:0:0:0
0.000399834 8.73101 15858 15858 7.72061e-05 0.000352206 NGram:2:1:0:0
0.000256753 -2.89172 32914 32914 6.72999e-05 0.000259938 NGram:2:2:0:0
-0.000327696 1.09118 10 10 0.0772046 0.0207822 CoPar
1.08069e-06 2.49936 22 22 0.0108412 0.0115617 CoLenPar
3.9228e-05 1.09613 300 300 8.22696e-05 0.000500958 SubjVerbAgr
-1.77063e-05 0.674698 35426 35426 0.000194276 0.000111449 WProj:101
0.000245469 1.38037 38921 38921 0.000164758 0.000110965 WProj:001
3.9228e-05 2.3027 16489 16489 0.000191317 0.000113536 WProj:110
-8.2274e-05 3.47672 15964 15964 0.000137918 0.000110958 WProj:010
1.00781e-05 1.16501 25181 25181 0.000215002 0.000121698 WProj:111
6.07482e-05 1.80004 26790 26790 0.000175404 0.000122975 WProj:011