t-250 - omni

t-250·WorkTask···

Created2 months ago·Updated1 month ago

Dependencies

t-247 [Blocks]

Description

Create Omni/Agent/Eval.hs module for running agent evaluations to prevent regression.

Context

When changing prompts, tools, or switching models, we need to verify the agent still performs correctly. Evals provide automated testing of agent behavior.

Current State

No eval framework exists
Agent testing is manual
Engine.hs has unit tests but no integration/behavioral tests

Requirements

1. Define core types:

data EvalCase = EvalCase
  { evalId :: Text
  , evalName :: Text
  , evalDescription :: Text
  , evalPrompt :: Text                    -- User prompt to send
  , evalTools :: [Engine.Tool]            -- Tools available
  , evalExpectedBehavior :: ExpectedBehavior
  , evalTimeout :: Maybe Int              -- Seconds, default 300
  }

data ExpectedBehavior
  = ContainsText Text                     -- Output contains this text
  | MatchesRegex Text                     -- Output matches regex
  | FileCreated FilePath                  -- This file was created
  | FileContains FilePath Text            -- File contains text
  | ExitSuccess                           -- Agent completed without error
  | CustomCheck (AgentResult -> IO Bool)  -- Custom validation function

data EvalResult = EvalResult
  { evalResultId :: Text
  , evalResultPassed :: Bool
  , evalResultDuration :: Double          -- Seconds
  , evalResultCost :: Double              -- Cents
  , evalResultOutput :: Text              -- Final agent message
  , evalResultError :: Maybe Text         -- Error if failed
  }

2. Implement eval runner:

runEval :: EngineConfig -> EvalCase -> IO EvalResult
runEvalSuite :: EngineConfig -> [EvalCase] -> IO [EvalResult]

-- Pretty print results
printEvalResults :: [EvalResult] -> IO ()

3. Create initial eval cases for Jr/coder:

coderEvalSuite :: [EvalCase]
coderEvalSuite =
  [ EvalCase
      { evalId = "create-file"
      , evalName = "Create a simple file"
      , evalPrompt = "Create a file at /tmp/eval-test.txt containing 'hello world'"
      , evalTools = coderTools
      , evalExpectedBehavior = FileContains "/tmp/eval-test.txt" "hello world"
      , evalTimeout = Just 60
      }
  , EvalCase
      { evalId = "edit-file"
      , evalName = "Edit existing file"
      , evalPrompt = "Change 'hello' to 'goodbye' in /tmp/eval-test.txt"
      , evalTools = coderTools
      , evalExpectedBehavior = FileContains "/tmp/eval-test.txt" "goodbye"
      , evalTimeout = Just 60
      }
  , EvalCase
      { evalId = "search-codebase"
      , evalName = "Search and report"
      , evalPrompt = "How many Haskell files are in Omni/Agent/?"
      , evalTools = coderTools
      , evalExpectedBehavior = ContainsText "8"  -- or whatever the count is
      , evalTimeout = Just 120
      }
  ]

4. Add CLI command:

In Omni/Jr.hs, add:

jr eval [--suite=NAME] [--case=ID]    Run agent evaluations

5. Scoring and reporting:

Track pass/fail rate
Track cost per eval
Track duration
Output summary table

Files to Create

Omni/Agent/Eval.hs - main eval module with types and runner
Omni/Agent/Eval/Coder.hs - coder-specific eval cases

Files to Modify

Omni/Jr.hs - add 'jr eval' command

Dependencies

Depends on t-247 (Provider) for running agents

Testing

bild --test Omni/Agent/Eval.hs
Run: jr eval --suite=coder
Verify pass/fail detection works

Notes

Evals should run in isolated temp directories when possible
Consider cost limits to prevent runaway evals
Start with simple cases, expand over time

Evals Framework

Dependencies

Description

Context

Current State

Requirements

1. Define core types:

2. Implement eval runner:

3. Create initial eval cases for Jr/coder:

4. Add CLI command:

5. Scoring and reporting:

Files to Create

Files to Modify

Dependencies

Testing

Notes

Timeline (1)