Evals Framework

t-250·WorkTask·
·
·
Created2 months ago·Updated1 month ago

Dependencies

Description

Edit

Create Omni/Agent/Eval.hs module for running agent evaluations to prevent regression.

Context

When changing prompts, tools, or switching models, we need to verify the agent still performs correctly. Evals provide automated testing of agent behavior.

Current State

  • No eval framework exists
  • Agent testing is manual
  • Engine.hs has unit tests but no integration/behavioral tests

Requirements

1. Define core types:

data EvalCase = EvalCase
  { evalId :: Text
  , evalName :: Text
  , evalDescription :: Text
  , evalPrompt :: Text                    -- User prompt to send
  , evalTools :: [Engine.Tool]            -- Tools available
  , evalExpectedBehavior :: ExpectedBehavior
  , evalTimeout :: Maybe Int              -- Seconds, default 300
  }

data ExpectedBehavior
  = ContainsText Text                     -- Output contains this text
  | MatchesRegex Text                     -- Output matches regex
  | FileCreated FilePath                  -- This file was created
  | FileContains FilePath Text            -- File contains text
  | ExitSuccess                           -- Agent completed without error
  | CustomCheck (AgentResult -> IO Bool)  -- Custom validation function

data EvalResult = EvalResult
  { evalResultId :: Text
  , evalResultPassed :: Bool
  , evalResultDuration :: Double          -- Seconds
  , evalResultCost :: Double              -- Cents
  , evalResultOutput :: Text              -- Final agent message
  , evalResultError :: Maybe Text         -- Error if failed
  }

2. Implement eval runner:

runEval :: EngineConfig -> EvalCase -> IO EvalResult
runEvalSuite :: EngineConfig -> [EvalCase] -> IO [EvalResult]

-- Pretty print results
printEvalResults :: [EvalResult] -> IO ()

3. Create initial eval cases for Jr/coder:

coderEvalSuite :: [EvalCase]
coderEvalSuite =
  [ EvalCase
      { evalId = "create-file"
      , evalName = "Create a simple file"
      , evalPrompt = "Create a file at /tmp/eval-test.txt containing 'hello world'"
      , evalTools = coderTools
      , evalExpectedBehavior = FileContains "/tmp/eval-test.txt" "hello world"
      , evalTimeout = Just 60
      }
  , EvalCase
      { evalId = "edit-file"
      , evalName = "Edit existing file"
      , evalPrompt = "Change 'hello' to 'goodbye' in /tmp/eval-test.txt"
      , evalTools = coderTools
      , evalExpectedBehavior = FileContains "/tmp/eval-test.txt" "goodbye"
      , evalTimeout = Just 60
      }
  , EvalCase
      { evalId = "search-codebase"
      , evalName = "Search and report"
      , evalPrompt = "How many Haskell files are in Omni/Agent/?"
      , evalTools = coderTools
      , evalExpectedBehavior = ContainsText "8"  -- or whatever the count is
      , evalTimeout = Just 120
      }
  ]

4. Add CLI command:

In Omni/Jr.hs, add:

jr eval [--suite=NAME] [--case=ID]    Run agent evaluations

5. Scoring and reporting:

  • Track pass/fail rate
  • Track cost per eval
  • Track duration
  • Output summary table

Files to Create

  • Omni/Agent/Eval.hs - main eval module with types and runner
  • Omni/Agent/Eval/Coder.hs - coder-specific eval cases

Files to Modify

  • Omni/Jr.hs - add 'jr eval' command

Dependencies

  • Depends on t-247 (Provider) for running agents

Testing

  • bild --test Omni/Agent/Eval.hs
  • Run: jr eval --suite=coder
  • Verify pass/fail detection works

Notes

  • Evals should run in isolated temp directories when possible
  • Consider cost limits to prevent runaway evals
  • Start with simple cases, expand over time

Timeline (1)

🔄[human]Open → Done1 month ago