Read PDF

Extract text content from PDF files using various tools.

Process

Check PDF accessibility - Verify file exists and is readable
Choose extraction tool - pdftotext for speed, Python for fallback
Test extraction - Try with a small section first
Handle large files - Extract page ranges to limit output
Clean up text - Remove extra whitespace and formatting
Verify quality - Check if text extraction was successful

Examples

# pdftotext (preferred)
pdftotext /path/to/file.pdf - | head -200

Python fallback

python3 - << 'PY'
import sys
from pathlib import Path

try:
    import pypdf
except Exception as e:
    raise SystemExit("pypdf not installed")

path = Path("/path/to/file.pdf")
reader = pypdf.PdfReader(str(path))
text = "\n".join(page.extract_text() or "" for page in reader.pages)
print(text[:4000])
PY

Tips

Large PDFs: extract a page range to limit output.
For tables, consider a CSV extraction with tabula if installed.