Research: auto-detection for best-of-N parallel execution

t-455·WorkTask·
·
·
Created1 month ago·Updated4 weeks ago

Description

Edit

Research: automatic detection of operations that would benefit from best-of-N parallel execution.

Problem

Some agent operations are inherently uncertain - running them multiple times and picking the best result improves reliability. But agents don't naturally ask for this.

Idea

The runtime could automatically detect "risky" operations and transparently run them N times, returning the best result. The agent's program doesn't change.

Research Questions

1. Detection heuristics - How do we know an operation is risky?

  • Size of code change (>50 lines?)
  • Complexity metrics (cyclomatic, AST depth?)
  • Historical failure rate for similar operations?
  • Semantic analysis of the task description?

2. Selection criteria - How do we pick the "best" result?

  • Compiles successfully
  • Tests pass
  • Smallest diff
  • Human preference model?

3. Prior art - Is there established research?

  • Best-of-N sampling in RLHF (used for reward model training)
  • Speculative decoding (different context)
  • Self-consistency in chain-of-thought
  • Tree of Thoughts / Graph of Thoughts

4. Cost model - When is it worth the extra compute?

  • If P(failure) > X and cost(retry) > cost(parallel), do it
  • Need to estimate failure probability

Where This Fits

In Op.hs, the infer primitive could be wrapped with a reliability layer that does transparent best-of-N. The Sequential/Parallel interpreter would handle this.

Output

  • Literature review of relevant techniques
  • Proposed detection heuristics
  • Cost/benefit analysis
  • Prototype implementation plan

Timeline (2)

💬[human]4 weeks ago

Connection to Prompt IR (from t-477 design session)

The Prompt IR design includes a hook for risk/uncertainty estimation:

data PromptMeta = PromptMeta
  { ...
  , pmEstimatedEntropy :: Maybe Float  -- Predicted output uncertainty
  ...
  }

How this enables best-of-N detection:

-- Should we run this inference multiple times?
shouldBestOfN :: PromptIR -> Int -> IO (Maybe Int)
shouldBestOfN ir defaultN = do
  entropy <- estimateEntropy ir
  -- High entropy = uncertain output = worth running N times
  if entropy > entropyThreshold
    then pure (Just defaultN)
    else pure Nothing

-- Estimate entropy from prompt structure
estimateEntropy :: PromptIR -> IO Float
estimateEntropy ir = do
  -- Heuristics that suggest high output variance:
  let factors = 
        [ -- Many additive sections = less constrained
          additiveRatio (pirSections ir) * 0.3
        , -- Low relevance scores = context may not help
          (1 - avgRelevance (pirSections ir)) * 0.3
        , -- High token count = more degrees of freedom
          tokenCountFactor (pmTotalTokens (pirMeta ir)) * 0.2
        , -- Few constraints = more possible outputs
          (1 - constraintRatio (pirSections ir)) * 0.2
        ]
  pure (sum factors)

Detection heuristics from the IR:

| Signal | IR Field | Interpretation | |--------|----------|----------------| | Many Additive sections | secCompositionMode | Less constrained, higher variance | | Low relevance scores | secRelevance | Context may not ground the output | | High token count | pmTotalTokens | More degrees of freedom | | Few Constraint sections | secCompositionMode | Fewer hard requirements | | No Hierarchical sections | secCompositionMode | No strong prior/anchor |

Integration point:

-- In interpreter, wrap infer with best-of-N when estimated entropy is high
interpret (Free (Op.Infer model contextReq k)) = do
  ir <- hydrateContext contextReq
  let ir' = ir { pirMeta = (pirMeta ir) { pmEstimatedEntropy = Just entropy } }
  
  case shouldBestOfN ir' 3 of
    Just n -> do
      results <- replicateM n (callLLM model (compile ir'))
      best <- selectBest results
      k best
    Nothing -> do
      result <- callLLM model (compile ir')
      k result