A CLI tool for extracting text from various document formats with configurable OCR capabilities.
- Multi-format Support: PDF, Word, HTML, E-books, Images, and text files
- Configurable OCR: llm-caller (Call LLM models) or surya_ocr (local OCR engine)
- Smart Content Strategy: Choose between text-first or image-first processing for PDFs
- Interactive Tool Selection: Auto-detects available tools and prompts for selection
- Automatic Tool Detection: No configuration files needed - tools are detected when required
- Cross-platform: macOS, Linux, Windows with automatic tool detection
- 🚀 Quick Start - Installation and basic usage
- 🔧 Development - Architecture and development guide
- 🏷️ Versioning - Version management and build system
# macOS
brew install ghostscript pandoc calibre && pip install surya-ocr
# Ubuntu/Linux
sudo apt-get install ghostscript pandoc calibre && pip install surya-ocr
# Windows (with Chocolatey)
choco install ghostscript pandoc calibre && pip install surya-ocr
Download pre-built binaries from Releases or build from source with go build
.
# Extract with interactive selection (prompts for OCR tool and content type)
doc-to-text document.pdf
# Use specific OCR tool
doc-to-text document.pdf --ocr surya_ocr
doc-to-text document.pdf --ocr llm-caller --llm-template qwen-vl-ocr
# Specify content processing strategy for PDFs
doc-to-text document.pdf --content-type text # Try Calibre first, OCR fallback
doc-to-text document.pdf --content-type image # Direct OCR processing
# Custom output
doc-to-text document.pdf -o output.txt
# Display and set language
doc-to-text language
# Switch language
DOC_TEXT_LANG=zh doc-to-text -V
# Version info
doc-to-text --version # Quick version
doc-to-text version # Detailed build info
The tool automatically detects required tools when needed:
- OCR Tools:
llm-caller
,surya_ocr
- Document Processing:
ebook-convert
(Calibre),pandoc
,gs
(Ghostscript) - Detection Strategy: Command lookup → Common paths → Clear error messages
# Temporarily override settings
DOC_TEXT_OCR_STRATEGY=surya_ocr doc-to-text document.pdf
DOC_TEXT_CONTENT_TYPE=text doc-to-text document.pdf
DOC_TEXT_MAX_CONCURRENCY=8 doc-to-text document.pdf
Setting | Description | Default |
---|---|---|
ocr_strategy |
OCR tool selection | interactive |
content_type |
PDF processing strategy | image |
max_concurrency |
Concurrent processes | 4 |
verbose |
Enable progress output | false |
Type | Extensions | Method |
---|---|---|
PDFs | .pdf |
OCR or Calibre (based on content-type) |
Images | .jpg , .png , .gif , .bmp , .tiff |
OCR |
Documents | .doc , .docx , .rtf , .odt , .ppt , .xls |
Pandoc |
Web | .html , .mhtml |
Built-in parser |
E-books | .epub , .mobi |
Calibre |
Text | .txt , .md , .json , .csv , .xml , .py , .js |
Direct reading |
- Local and multilingual (100+ languages)
- Installation:
pip install surya-ocr
- Best for: Standard documents, batch processing
llm-caller (Configurable AI)
- AI-powered with template-based approach
- Requires:
--llm-template
parameter - Best for: Complex layouts, handwritten text, specific models
- Default mode: Automatically prompts for tool selection
- Smart detection: Shows only available engines
- Auto-selection: For text content-type, automatically selects best tool without prompts
The --content-type
parameter determines PDF processing strategy:
text
: Tries Calibre first (fast for text-based PDFs), then OCR if failedimage
: Uses OCR directly (default, best for scanned documents)
Text is extracted to organized directories:
- Input:
/path/to/document.pdf
- Output:
/path/to/{md5_hash}/text.txt
- Pages:
/path/to/{md5_hash}/pages/
(for PDFs)
Large document processing can be interrupted and resumed. The tool automatically:
- Detects completed pages and skips them
- Continues from the last processed page
- Maintains processing state in intermediate directories
OCR tool not found: Tools are automatically detected. Ensure they are installed and available in your PATH
Permission errors: Ensure tools are executable and paths are accessible
Poor OCR quality: Try different OCR engines or ensure good source quality (300 DPI recommended)