Read PDF

Extract text content from PDF files using various tools.

Process

  1. Check PDF accessibility - Verify file exists and is readable
  2. Choose extraction tool - pdftotext for speed, Python for fallback
  3. Test extraction - Try with a small section first
  4. Handle large files - Extract page ranges to limit output
  5. Clean up text - Remove extra whitespace and formatting
  6. Verify quality - Check if text extraction was successful

Examples

# pdftotext (preferred)
pdftotext /path/to/file.pdf - | head -200

Python fallback

python3 - << 'PY'
import sys
from pathlib import Path

try:
    import pypdf
except Exception as e:
    raise SystemExit("pypdf not installed")

path = Path("/path/to/file.pdf")
reader = pypdf.PdfReader(str(path))
text = "\n".join(page.extract_text() or "" for page in reader.pages)
print(text[:4000])
PY

Tips