Run comprehensive benchmarks comparing all execution modes.
After completing the spikes, we need rigorous benchmarks to understand:
| Mode | Description | |------|-------------| | Single | 1 agent, sequential iterations | | Parallel | N agents, no communication, CRDT merge | | Swarm-STM | N agents, STM shared memory | | Swarm-CRDT | N agents, CRDT eventual consistency |
| Task | Nature | Expected Best Mode | |------|--------|-------------------| | Batch compute | Embarrassingly parallel | Parallel | | Optimization | Search with sharing | Swarm-STM | | Constraint solving | Propagation | Swarm-STM | | Research synthesis | Fact accumulation | Swarm-CRDT | | Sequential reasoning | Dependencies | Single |
data BenchmarkConfig = BenchmarkConfig
{ bcMode :: ExecutionMode
, bcAgents :: Int
, bcBudget :: Double
, bcTask :: BenchmarkTask
}
data BenchmarkResult = BenchmarkResult
{ brConfig :: BenchmarkConfig
, brSuccess :: Bool
, brQuality :: Double -- task-specific quality metric
, brCost :: Double -- total cost in cents
, brWallTime :: Double -- seconds
, brTokens :: Int
, brIterations :: Int
, brCoordinationOverhead :: Double -- for swarm modes
}
runBenchmarkSuite :: IO [BenchmarkResult]
data BenchmarkTask
= BatchCompute [Text] -- independent computations
| Optimization Function Domain -- find maximum
| Sudoku [[Maybe Int]] -- solve puzzle
| Research Topic -- synthesize facts
| Sequential [Step] -- ordered dependencies
-- Compare modes on same task
compareModesOnTask :: BenchmarkTask -> IO ComparisonReport
-- Find optimal agent count
findOptimalAgents :: ExecutionMode -> BenchmarkTask -> IO (Int, BenchmarkResult)
-- Cost efficiency frontier
costEfficiencyFrontier :: [BenchmarkResult] -> [(Double, Double)] -- (cost, quality)
Generate:
| Metric | How Measured | |--------|--------------| | Quality | Task-specific (score, solved%, facts found) | | Cost | Actual API cost from traces | | Wall time | Start to finish | | Speedup | Wall time ratio vs single agent | | Efficiency | Quality / Cost | | Overhead | STM retries, wasted iterations |
Based on spikes, hypothesize:
1. Batch compute: Parallel > Swarm > Single (no coordination needed) 2. Optimization: Swarm-STM > Parallel > Single (sharing helps) 3. Sudoku: Swarm-STM >> others (constraint propagation crucial) 4. Research: Swarm-CRDT ≈ Parallel > Single (eventual consistency ok) 5. Sequential: Single > others (parallelism adds overhead)
BENCHMARK RESULTS
=================
Task: Optimization
------------------
Mode Agents Quality Cost Time Speedup
Single 1 0.82 $0.003 45s 1.0x
Parallel 5 0.85 $0.015 12s 3.8x
Swarm-STM 5 0.94 $0.018 10s 4.5x ← BEST
Swarm-CRDT 5 0.87 $0.015 11s 4.1x
Recommendation: Swarm-STM with 5 agents for optimization tasks
Task: Batch Compute
-------------------
Mode Agents Quality Cost Time Speedup
Single 1 1.00 $0.010 60s 1.0x
Parallel 5 1.00 $0.010 12s 5.0x ← BEST
Swarm-STM 5 1.00 $0.012 13s 4.6x
...