t-369.21 - omni

t-369.21·WorkTask····Omni/Agent.hs

Parent:t-369·Created1 month ago·Updated1 month ago

Dependencies

t-369.20 [Blocks]

Description

Edit

Run comprehensive benchmarks comparing all execution modes.

Context

After completing the spikes, we need rigorous benchmarks to understand:

When each mode is best
Cost/performance trade-offs
Scaling characteristics

Benchmark Matrix

Execution Modes

| Mode | Description | |------|-------------| | Single | 1 agent, sequential iterations | | Parallel | N agents, no communication, CRDT merge | | Swarm-STM | N agents, STM shared memory | | Swarm-CRDT | N agents, CRDT eventual consistency |

Task Types

| Task | Nature | Expected Best Mode | |------|--------|-------------------| | Batch compute | Embarrassingly parallel | Parallel | | Optimization | Search with sharing | Swarm-STM | | Constraint solving | Propagation | Swarm-STM | | Research synthesis | Fact accumulation | Swarm-CRDT | | Sequential reasoning | Dependencies | Single |

Scale Factors

Agents: 1, 2, 5, 10, 20
Task complexity: Simple, Medium, Hard
Budget: sh.01, sh.10, .00

Deliverables

1. Benchmark Framework

data BenchmarkConfig = BenchmarkConfig
  { bcMode :: ExecutionMode
  , bcAgents :: Int
  , bcBudget :: Double
  , bcTask :: BenchmarkTask
  }

data BenchmarkResult = BenchmarkResult
  { brConfig :: BenchmarkConfig
  , brSuccess :: Bool
  , brQuality :: Double      -- task-specific quality metric
  , brCost :: Double         -- total cost in cents
  , brWallTime :: Double     -- seconds
  , brTokens :: Int
  , brIterations :: Int
  , brCoordinationOverhead :: Double  -- for swarm modes
  }

runBenchmarkSuite :: IO [BenchmarkResult]

2. Benchmark Tasks

data BenchmarkTask
  = BatchCompute [Text]           -- independent computations
  | Optimization Function Domain  -- find maximum
  | Sudoku [[Maybe Int]]          -- solve puzzle
  | Research Topic                -- synthesize facts
  | Sequential [Step]             -- ordered dependencies

3. Analysis Functions

-- Compare modes on same task
compareModesOnTask :: BenchmarkTask -> IO ComparisonReport

-- Find optimal agent count
findOptimalAgents :: ExecutionMode -> BenchmarkTask -> IO (Int, BenchmarkResult)

-- Cost efficiency frontier
costEfficiencyFrontier :: [BenchmarkResult] -> [(Double, Double)]  -- (cost, quality)

4. Reports

Generate:

Summary table (mode × task × agents)
Scaling curves (quality vs agents)
Cost efficiency plots
Recommendations per task type

Metrics

| Metric | How Measured | |--------|--------------| | Quality | Task-specific (score, solved%, facts found) | | Cost | Actual API cost from traces | | Wall time | Start to finish | | Speedup | Wall time ratio vs single agent | | Efficiency | Quality / Cost | | Overhead | STM retries, wasted iterations |

Expected Findings

Based on spikes, hypothesize:

1. Batch compute: Parallel > Swarm > Single (no coordination needed) 2. Optimization: Swarm-STM > Parallel > Single (sharing helps) 3. Sudoku: Swarm-STM >> others (constraint propagation crucial) 4. Research: Swarm-CRDT ≈ Parallel > Single (eventual consistency ok) 5. Sequential: Single > others (parallelism adds overhead)

Output

BENCHMARK RESULTS
=================

Task: Optimization
------------------
Mode          Agents  Quality  Cost    Time    Speedup
Single        1       0.82     $0.003  45s     1.0x
Parallel      5       0.85     $0.015  12s     3.8x
Swarm-STM     5       0.94     $0.018  10s     4.5x  ← BEST
Swarm-CRDT    5       0.87     $0.015  11s     4.1x

Recommendation: Swarm-STM with 5 agents for optimization tasks

Task: Batch Compute
-------------------
Mode          Agents  Quality  Cost    Time    Speedup
Single        1       1.00     $0.010  60s     1.0x
Parallel      5       1.00     $0.010  12s     5.0x  ← BEST
Swarm-STM     5       1.00     $0.012  13s     4.6x
...

Files

Omni/Agent/Experiments/Benchmark.hs
Omni/Agent/Experiments/BENCHMARK_RESULTS.md

Comprehensive benchmark: single vs parallel vs swarm