[ PROMPT_NODE_25661 ]
PDF Processing Pro
[ SKILL_DOCUMENTATION ]
# PDF Processing Pro
Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.
## Quick start
### Extract text from PDF
```python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
print(text)
```
### Analyze PDF form (using included script)
```bash
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions
```
### Fill PDF form with validation
```bash
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting
```
### Extract tables from PDF
```bash
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection
```
## Features
### ✅ Production-ready scripts
All scripts include:
- **Error handling**: Graceful failures with detailed error messages
- **Validation**: Input validation and type checking
- **Logging**: Configurable logging with timestamps
- **Type hints**: Full type annotations for IDE support
- **CLI interface**: `--help` flag for all scripts
- **Exit codes**: Proper exit codes for automation
### ✅ Comprehensive workflows
- **PDF Forms**: Complete form processing pipeline
- **Table Extraction**: Advanced table detection and extraction
- **OCR Processing**: Scanned PDF text extraction
- **Batch Operations**: Process multiple PDFs efficiently
- **Validation**: Pre and post-processing validation
## Advanced topics
### PDF Form Processing
For complete form workflows including:
- Field analysis and detection
- Dynamic form filling
- Validation rules
- Multi-page forms
- Checkbox and radio button handling
See [FORMS.md](FORMS.md)
### Table Extraction
For complex table extraction:
- Multi-page tables
- Merged cells
- Nested tables
- Custom table detection
- Export to CSV/Excel
See [TABLES.md](TABLES.md)
### OCR Processing
For scanned PDFs and image-based documents:
- Tesseract integration
- Language support
- Image preprocessing
- Confidence scoring
- Batch OCR
See [OCR.md](OCR.md)
## Included scripts
### Form processing
**analyze_form.py** - Extract form field information
```bash
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
```
**fill_form.py** - Fill PDF forms with data
```bash
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
```
**validate_form.py** - Validate form data before filling
```bash
python scripts/validate_form.py data.json schema.json
```
### Table extraction
**extract_tables.py** - Extract tables to CSV/Excel
```bash
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
```
### Text extraction
**extract_text.py** - Extract text with formatting preservation
```bash
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
```
### Utilities
**merge_pdfs.py** - Merge multiple PDFs
```bash
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
```
**split_pdf.py** - Split PDF into individual pages
```bash
python scripts/split_pdf.py input.pdf --output-dir pages/
```
**validate_pdf.py** - Validate PDF integrity
```bash
python scripts/validate_pdf.py input.pdf
```
## Common workflows
### Workflow 1: Process form submissions
```bash
# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json
# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json
# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf
# 4. Validate output
python scripts/validate_pdf.py completed.pdf
```
### Workflow 2: Extract data from reports
```bash
# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv
# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt
```
### Workflow 3: Batch processing
```python
import glob
from pathlib import Path
import subprocess
# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
output_file = Path("processed") / Path(pdf_file).name
result = subprocess.run([
"python", "scripts/extract_text.py",
pdf_file,
"--output", str(output_file)
], capture_output=True)
if result.returncode == 0:
print(f"✓ Processed: {pdf_file}")
else:
print(f"✗ Failed: {pdf_file} - {result.stderr}")
```
## Error handling
All scripts follow consistent error patterns:
```python
# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error
# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])
if result.returncode == 0:
print("Success")
elif result.returncode == 4:
print("Validation failed - check input data")
else:
print(f"Error occurred: {result.returncode}")
```
## Dependencies
All scripts require:
```bash
pip install pdfplumber pypdf pillow pytesseract pandas
```
Optional for OCR:
```bash
# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases
```
## Performance tips
- **Use batch processing** for multiple PDFs
- **Enable multiprocessing** with `--parallel` flag (where supported)
- **Cache extracted data** to avoid re-processing
- **Validate inputs early** to fail fast
- **Use streaming** for large PDFs (>50MB)
## Best practices
1. **Always validate inputs** before processing
2. **Use try-except** in custom scripts
3. **Log all operations** for debugging
4. **Test with sample PDFs** before production
5. **Set timeouts** for long-running operations
6. **Check exit codes** in automation
7. **Backup originals** before modification
## Troubleshooting
### Common issues
**"Module not found" errors**:
```bash
pip install -r requirements.txt
```
**Tesseract not found**:
```bash
# Install tesseract system package (see Dependencies)
```
**Memory errors with large PDFs**:
```python
# Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
# Process page immediately
```
**Permission errors**:
```bash
chmod +x scripts/*.py
```
## Getting help
All scripts support `--help`:
```bash
python scripts/analyze_form.py --help
python scripts/extract_tables.py --help
```
For detailed documentation on specific topics, see:
- [FORMS.md](FORMS.md) - Complete form processing guide
- [TABLES.md](TABLES.md) - Advanced table extraction
- [OCR.md](OCR.md) - Scanned PDF processing
Source: claude-code-templates (MIT). See About Us for full credits.