Skip to content

Converting from Penn Treebank to Basic Stanford Dependencies

Maria (Masha) Alexeeva edited this page May 6, 2020 · 3 revisions

How to convert from Penn Treebank format to Basic Stanford Dependencies

Two corpora, GENIA and Wall Street Journal, are used for training and each can be found in Penn Treebank format. To convert these files to basic Stanford dependencies (.conllx format), you can use the Stanford parser by running the following command from the root directory after unzipping the parser, where treebank (the argument for the -treeFile option) is the file you want to convert:

java -cp "*" -mx1g edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile treebank > treebank.conllx

Note: CoreNLP 3.5.2 and above use Universal Dependencies, so use version 3.5.1., which can be found here.

Note: if you run version 3.5.1 as is, you may run into the following warning:

UniversalPOSMapper: Warning - could not load Tsurgeon file from edu/stanford/nlp/models/upos/ENUniversalPOS.tsurgeon.

The conversion will go through even without this file, but the output will be missing universal POS tags. One way to solve this is to add a jar containing the missing file.

You can follow these steps:

  1. Get the missing file from here. Granted the file is still in the same location, you can get the file by running this:
curl -O https://github.com/stanfordnlp/CoreNLP/master/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon
  1. From the root directory, create the directory where the missing file should go based on what you see in the warning (e.g., edu/stanford/nlp/models/upos/). You can do this by running this:
mkdir -p edu/stanford/nlp/models/upos/
  1. Move the missing file into the newly-created directory:
mv ENUniversalPOS.tsurgeon edu/stanford/nlp/models/upos/
  1. Create the jar:
jar cf stanford-parser-3.5.1-missing-file.jar edu/stanford/nlp/models/upos/ENUniversalPOS.tsurgeon

The GENIA corpus can be found in Penn Treebank format here. The [train|dev|test|future_use].trees files can be easily converted using the above command. For our copy of the Wall Street Journal, the treebank files are separated into batches in a directory called wsj (note: use the combined format of these files). Moving/copying this directory into the Stanford parser root directory, the following bash script will convert all batches into a separate directory called wsj-conllx (note that this can take over a half hour to finish):

base=wsj
outputDir="wsj-conllx"
if [ ! -d ${outputDir} ]; then
	mkdir ${outputDir}
fi
for dir in ${base}/*
do
	resultDir="${outputDir}/`basename ${dir}`"
	if [ ! -d ${resultDir} ]; then
		mkdir ${resultDir}
	fi
	for f in ${dir}/*.mrg
	do
		outputFile="${resultDir}/`basename "${f}" .mrg`.conllx"
		java -cp "*" -mx1g edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile "${f}" > "${outputFile}"
		echo "Processed ${f}, output in ${outputFile}"
	done
done

For more information, consult Stanford Dependencies, namely the section under "SD for English".