Research: static prompt analysis using open model internals

t-397·WorkTask·
·
·
Created1 month ago·Updated4 weeks ago

Description

Edit

Explore static analysis of prompts by examining open-source model internals (embeddings, attention patterns, activations).

Motivation

If we can analyze what a prompt 'will do' without running full inference, we can:

  • Reject unsafe prompts before execution
  • Predict resource usage
  • Identify prompt equivalences

Prior Art

  • 'Bayesian Geometry of Transformer Attention' (arxiv 2512.22471) - attention realizes Bayesian inference geometrically
  • ProSA - prompt sensitivity analysis (but requires running the model)
  • Mechanistic interpretability work (Anthropic, EleutherAI)

Research Questions

1. Can we predict posterior entropy from prompt embeddings alone? 2. Can we detect 'will this prompt use tools' from early layer activations? 3. Can we identify equivalent prompts via embedding geometry? 4. What's the manifold structure of prompt space?

Approach

1. Use open models (LLaMA, Mistral) where we can access internals 2. Build dataset: prompts paired with their behavioral outcomes 3. Train lightweight probe/classifier on internal representations 4. See what's predictable without full forward pass

Open Models to Consider

  • LLaMA 3 (open weights, good capability)
  • Mistral (efficient, open)
  • Phi (small, fast iteration)

Notes

This is genuine research, not engineering. May not succeed. But if it works, it's the foundation for principled Analyze operation.

Timeline (2)

💬[human]4 weeks ago

Connection to Prompt IR (from t-477 design session)

The Prompt IR design includes hooks for static analysis via embeddings:

Per-section embeddings:

data Section = Section
  { ...
  , secEmbedding :: Maybe (Vector Float)  -- Precomputed embedding
  , secHash :: Maybe Text                 -- Content hash for caching
  ...
  }

Aggregate metadata:

data PromptMeta = PromptMeta
  { ...
  , pmEstimatedEntropy :: Maybe Float  -- Predicted output uncertainty
  , pmCacheHit :: Bool                 -- Was this IR cached?
  }

Analysis operations:

-- Estimate behavioral impact of a section (for pruning decisions)
estimateImpact :: Section -> IO Float
estimateImpact sec = case secEmbedding sec of
  Just emb -> pure (embeddingMagnitude emb)  -- Proxy for information content
  Nothing -> computeEmbedding (secContent sec) >>= pure . embeddingMagnitude

-- Check if two IRs are behaviorally equivalent (within tolerance)
equivalent :: Float -> PromptIR -> PromptIR -> IO Bool
equivalent tolerance a b = do
  embA <- computeIREmbedding a
  embB <- computeIREmbedding b
  pure (cosineSimilarity embA embB > (1 - tolerance))

-- Predict output entropy from prompt geometry
estimateEntropy :: PromptIR -> IO Float
estimateEntropy ir = do
  -- Hypothesis: prompts with higher embedding variance → higher output entropy
  embeddings <- mapM (computeEmbedding . secContent) (pirSections ir)
  pure (embeddingVariance embeddings)

Research hooks:

  • secEmbedding enables per-section analysis without full inference
  • equivalent can identify prompt equivalences for caching
  • estimateEntropy could predict "risky" prompts (high uncertainty → run best-of-N)
  • secHash enables content-addressable caching of analysis results

Open questions for this research: 1. What embedding model works best? (task-specific vs general) 2. Is embedding magnitude actually correlated with information content? 3. Can we predict tool usage from section embeddings alone? 4. What's the manifold structure of the section embedding space?