Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. The model takes images of single words or text lines (multiple words) as input and outputs the recognized text. 3/4 of the words from the validation-set are correctly recognized, and the character error rate is around 10%.
- Download one of the pretrained models
- Model trained on word images: only handles single words per image, but gives better results on the IAM word dataset
- Model trained on text line images: can handle multiple words in one image
- Put the contents of the downloaded zip-file into the
model
directory of the repository - Go to the
src
directory - Run inference code:
- Execute
python main.py
to run the model on an image of a word - Execute
python main.py --img_file ../data/line.png
to run the model on an image of a text line
- Execute
The input images, and the expected outputs are shown below when the text line model is used.
> python main.py
Init with stored values from ../model/snapshot-13
Recognized: "word"
Probability: 0.9806370139122009
> python main.py --img_file ../data/line.png
Init with stored values from ../model/snapshot-13
Recognized: "or work on line level"
Probability: 0.6674373149871826
--mode
: select between "train", "validate" and "infer". Defaults to "infer".--decoder
: select from CTC decoders "bestpath", "beamsearch" and "wordbeamsearch". Defaults to "bestpath". For option "wordbeamsearch" see details below.--batch_size
: batch size.--data_dir
: directory containing IAM dataset (with subdirectoriesimg
andgt
).--fast
: use LMDB to load images faster.--line_mode
: train reading text lines instead of single words.--img_file
: image that is used for inference.--dump
: dumps the output of the NN to CSV file(s) saved in thedump
folder. Can be used as input for the [CTCDecoder]
The word beam search decoder can be used instead of the two decoders shipped with TF. Words are constrained to those contained in a dictionary, but arbitrary non-word character strings (numbers, punctuation marks) can still be recognized. The following illustration shows a sample for which word beam search is able to recognize the correct text, while the other decoders fail.
Follow these instructions to get the IAM dataset:
- Register for free at this website
- Download
words/words.tgz
- Download
ascii/words.txt
- Create a directory for the dataset on your disk, and create two subdirectories:
img
andgt
- Put
words.txt
into thegt
directory - Put the content (directories
a01
,a02
, ...) ofwords.tgz
into theimg
directory
- Delete files from
model
directory if you want to train from scratch - Go to the
src
directory and executepython main.py --mode train --data_dir path/to/IAM
- The IAM dataset is split into 95% training data and 5% validation data
- If the option
--line_mode
is specified, the model is trained on text line images created by combining multiple word images into one - Training stops after a fixed number of epochs without improvement
The pretrained word model was trained with this command on a GTX 1050 Ti:
python main.py --mode train --fast --data_dir path/to/iam --batch_size 500 --early_stopping 15
And the line model with:
python main.py --mode train --fast --data_dir path/to/iam --batch_size 250 --early_stopping 10
Loading and decoding the png image files from the disk is the bottleneck even when using only a small GPU. The database LMDB is used to speed up image loading:
- Go to the
src
directory and runcreate_lmdb.py --data_dir path/to/iam
with the IAM data directory specified - A subfolder
lmdb
is created in the IAM data directory containing the LMDB files - When training the model, add the command line option
--fast
The dataset should be located on an SSD drive.
Using the --fast
option and a GTX 1050 Ti training on single words takes around 3h with a batch size of 500.
Training on text lines takes a bit longer.
The model is a stripped-down version of the HTR system I implemented for my thesis. What remains is the bare minimum to recognize text with an acceptable accuracy. It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. For more details see this Medium article.