Add a Compress operation to the free monad that finds shorter prompts preserving output behavior.
Approach
Iterative compression: shorten prompt, verify outputs are 'close enough', repeat. Not rate-distortion optimal, but practically useful for cost reduction.
Prior Art
- 'Fundamental Limits of Prompt Compression' (NeurIPS 2024) - theoretical framework
- LLMLingua, AutoCompressor - existing compression methods
Implementation
1. Add Compress constructor to OpF with distortion bound parameter
2. Implement compression strategies:
- Remove redundant instructions
- Summarize examples (few-shot -> distilled)
- Identify sufficient statistics (what info actually matters)
3. Verification: run both prompts, compare outputs
Metrics & Verification
- Compression ratio (tokens saved)
- Output fidelity (embedding similarity, exact match on structured outputs)
- Cost savings in practice
Test Cases
- Verbose system prompts -> minimal equivalent
- Many-shot examples -> fewer examples with same behavior
- Long context -> compressed context
Notes
Need to define 'close enough' carefully. For tool-calling agents, maybe exact same tool sequence? For generation, embedding similarity?