t-455 - omni

t-455·WorkTask···

Created1 month ago·Updated4 weeks ago

Description

Research: automatic detection of operations that would benefit from best-of-N parallel execution.

Problem

Some agent operations are inherently uncertain - running them multiple times and picking the best result improves reliability. But agents don't naturally ask for this.

Idea

The runtime could automatically detect "risky" operations and transparently run them N times, returning the best result. The agent's program doesn't change.

Research Questions

1. Detection heuristics - How do we know an operation is risky?

Size of code change (>50 lines?)
Complexity metrics (cyclomatic, AST depth?)
Historical failure rate for similar operations?
Semantic analysis of the task description?

2. Selection criteria - How do we pick the "best" result?

Compiles successfully
Tests pass
Smallest diff
Human preference model?

3. Prior art - Is there established research?

Best-of-N sampling in RLHF (used for reward model training)
Speculative decoding (different context)
Self-consistency in chain-of-thought
Tree of Thoughts / Graph of Thoughts

4. Cost model - When is it worth the extra compute?

If P(failure) > X and cost(retry) > cost(parallel), do it
Need to estimate failure probability

Where This Fits

In Op.hs, the infer primitive could be wrapped with a reliability layer that does transparent best-of-N. The Sequential/Parallel interpreter would handle this.

Output

Literature review of relevant techniques
Proposed detection heuristics
Cost/benefit analysis
Prototype implementation plan

Timeline (2)

💬[human]4 weeks ago

Connection to Prompt IR (from t-477 design session)

The Prompt IR design includes a hook for risk/uncertainty estimation:

data PromptMeta = PromptMeta
  { ...
  , pmEstimatedEntropy :: Maybe Float  -- Predicted output uncertainty
  ...
  }

How this enables best-of-N detection:

-- Should we run this inference multiple times?
shouldBestOfN :: PromptIR -> Int -> IO (Maybe Int)
shouldBestOfN ir defaultN = do
  entropy <- estimateEntropy ir
  -- High entropy = uncertain output = worth running N times
  if entropy > entropyThreshold
    then pure (Just defaultN)
    else pure Nothing

-- Estimate entropy from prompt structure
estimateEntropy :: PromptIR -> IO Float
estimateEntropy ir = do
  -- Heuristics that suggest high output variance:
  let factors = 
        [ -- Many additive sections = less constrained
          additiveRatio (pirSections ir) * 0.3
        , -- Low relevance scores = context may not help
          (1 - avgRelevance (pirSections ir)) * 0.3
        , -- High token count = more degrees of freedom
          tokenCountFactor (pmTotalTokens (pirMeta ir)) * 0.2
        , -- Few constraints = more possible outputs
          (1 - constraintRatio (pirSections ir)) * 0.2
        ]
  pure (sum factors)

Detection heuristics from the IR:

| Signal | IR Field | Interpretation | |--------|----------|----------------| | Many Additive sections | secCompositionMode | Less constrained, higher variance | | Low relevance scores | secRelevance | Context may not ground the output | | High token count | pmTotalTokens | More degrees of freedom | | Few Constraint sections | secCompositionMode | Fewer hard requirements | | No Hierarchical sections | secCompositionMode | No strong prior/anchor |

Integration point:

-- In interpreter, wrap infer with best-of-N when estimated entropy is high
interpret (Free (Op.Infer model contextReq k)) = do
  ir <- hydrateContext contextReq
  let ir' = ir { pirMeta = (pirMeta ir) { pmEstimatedEntropy = Just entropy } }
  
  case shouldBestOfN ir' 3 of
    Just n -> do
      results <- replicateM n (callLLM model (compile ir'))
      best <- selectBest results
      k best
    Nothing -> do
      result <- callLLM model (compile ir')
      k result

Research: auto-detection for best-of-N parallel execution