Research: rate-distortion theory for prompt compression

t-399·WorkTask·
·
·
Created1 month ago·Updated4 weeks ago

Description

Edit

Develop theoretical bounds and algorithms for optimal prompt compression.

Core Question

Given a prompt P and distortion bound ε, what's the shortest prompt P' such that outputs are within ε?

Prior Art

  • 'Fundamental Limits of Prompt Compression' (NeurIPS 2024) - theoretical framework for black-box LLMs
  • Rate-distortion theory (Shannon) - information-theoretic foundation
  • Sufficient statistics - minimal representations preserving inference

Key Concepts

1. Rate: bits/tokens needed to represent prompt 2. Distortion: divergence between outputs (KL, embedding distance, behavioral) 3. Rate-distortion curve: fundamental tradeoff

Research Questions

1. What's the rate-distortion function for LLM prompts? 2. Can we compute sufficient statistics for a prompt? 3. Is there a polynomial-time algorithm for near-optimal compression? 4. How does compression interact with factorization?

Approach

1. Study the NeurIPS paper deeply 2. Formalize distortion metrics for agent behavior (not just text similarity) 3. Derive bounds for specific prompt classes 4. Develop practical algorithms that approach bounds

Validation

  • Prove theoretical bounds
  • Empirically measure achieved rates vs bounds
  • Benchmark against heuristic compression

Notes

Heavy theory. May need collaboration with information theory researchers. But would enable provably optimal prompt caching/compression.

Timeline (2)

💬[human]4 weeks ago

Connection to Prompt IR (from t-477 design session)

The Prompt IR design includes explicit support for rate-distortion optimization:

Per-section compression metadata:

data Section = Section
  { ...
  , secTokens :: Int           -- Current token count (rate)
  , secMinTokens :: Maybe Int  -- Minimum viable tokens (compression floor)
  , secPriority :: Priority    -- For compression ordering
  , secRelevance :: Maybe Float -- Task-specific relevance (distortion proxy)
  ...
  }

Compression operation:

compress :: TokenBudget -> PromptIR -> IO PromptIR
compress budget ir = do
  let currentTokens = pmTotalTokens (pirMeta ir)
      targetTokens = floor (fromIntegral (tbTotal budget) * (1 - tbReserveRatio budget))
  
  if currentTokens <= targetTokens
    then pure ir
    else do
      -- Sort sections by compression priority
      let ranked = rankForCompression (pirSections ir)
      -- Drop/summarize lowest-value sections until under budget
      compressed <- compressSections targetTokens ranked
      pure ir { pirSections = compressed }

-- Compression score: higher = keep, lower = compress first
rankForCompression :: [Section] -> [(Section, Float)]
rankForCompression = map (\s -> (s, compressionScore s))
  where
    compressionScore s = 
      priorityWeight (secPriority s) 
      * fromMaybe 0.5 (secRelevance s)
      * recencyWeight (secRecency s)

Research hooks:

  • secMinTokens = sufficient statistics (minimum representation preserving inference)
  • secRelevance = proxy for distortion contribution
  • pmCompressionRatio tracks original/compressed for rate measurement
  • Can empirically measure rate-distortion curve by varying targetTokens

Key insight: The IR makes compression decisions explicit and traceable. We can log which sections were compressed/dropped and correlate with behavioral changes to build empirical rate-distortion curves.