auto-tesseract

What

This is a Python3 service to detect (via Linux's inotify()) new PDFs in --in_dir, and automatically rewrite them as searchable PDFs in --out_dir. This is particularly useful if, for example, you have a scanner that outputs PDFs onto a fileserver (as we do :-)).

This program uses Tesseract OCR's to produce the text. This program also uses ImageMagick to convert the original PDF to TIFF format, for use by Tesseract. In both cases, the program relies on the command-line tools from these packages.

This package also depends on Linux's inotify() system call.

Usage

This program has two main arguments: --in_dir and --out_dir. Any PDF copied or moved into --in_dir will automatically trigger the process, and a searchable version of the PDF will appear shortly in --out_dir. The original is untouched.

% ./main.py --in_dir=input --out_dir=output

This program will also backfill missing work. If there are files in the input directory that do not exist in the output directory, auto-tesseract will automatically OCR them immediately upon startup.

Build

This program is built with Bazel.

To build the main package:

% bazel build :main

To build a Debian package:

% bazel build :main-deb

To run all the tests:

% bazel test ...

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
debian		debian
testdata		testdata
.bazelrc		.bazelrc
.gitignore		.gitignore
BUILD		BUILD
LICENSE		LICENSE
README.md		README.md
WORKSPACE		WORKSPACE
main.py		main.py
main_integration_test.py		main_integration_test.py
main_test.py		main_test.py
mypy.ini		mypy.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auto-tesseract

What

Usage

Build

About

Releases

Packages

Languages

License

seanrees/auto-tesseract

Folders and files

Latest commit

History

Repository files navigation

auto-tesseract

What

Usage

Build

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages