Refactor the agent system to use a free monad over inference primitives, enabling parallel composition, pluggable state strategies, first-class observability, and checkpointing. See Omni/Agent/ARCHITECTURE.md for full design.
Added t-369.17 (Spike: Code-Only Agent Primitive) as an early validation experiment.
Phase 0: Validate Core Hypothesis
This tests whether Think + Execute is a viable primitive before we commit to the full Op architecture. Results inform whether:
If spike succeeds: Proceed with Op architecture, but simplify to Think + Execute as core primitives.
If spike fails: Proceed with Op architecture using current Infer + Tool model, document why code-only doesn't work.
Phases 1-6: (As before, but may be modified based on spike findings)
Is this:
data OpF s next where
Think :: Model -> Prompt -> (Text -> next) -> OpF s next
Execute :: Code -> (Result -> next) -> OpF s next
Par :: [Op s a] -> ([a] -> next) -> OpF s next
-- tools are just code the model writes
Better than this:
data OpF s next where
Infer :: Model -> Prompt -> (Response -> next) -> OpF s next
Tool :: Name -> Args -> (Result -> next) -> OpF s next
Par :: [Op s a] -> ([a] -> next) -> OpF s next
-- tools are predefined capabilities
The spike will tell us.
Phase 1-2 (Foundation) and Phase 3-5 (Integration) are DONE:
The spike (t-369.17) validated Think + Execute as a fundamental primitive:
| Task | Description | Status | |------|-------------|--------| | t-369.18 | Docker sandbox for code-only | Ready | | t-369.22 | Correct token tracking | Ready | | t-369.19 | Parallel code-only spike | Blocked on t-369.18 | | t-369.20 | Swarm + STM experiment | Blocked on t-369.19, t-369.7 | | t-369.21 | Comprehensive benchmark | Blocked on t-369.20 |
1. Wire up Docker sandbox (t-369.18) - security critical 2. Fix token tracking (t-369.22) - accuracy critical 3. Run parallel spike (t-369.19) - validate scaling 4. Run swarm experiment (t-369.20) - the big test 5. Benchmark everything (t-369.21) - data for decisions
See Omni/Agent/DESIGN.md for consolidated learnings and architecture.
Added 3 tasks to properly validate STM coordination on a real-world task:
The optimization task (sin*cos*sin) didn't benefit from sharing because:
Using CUAD dataset (500+ real contracts, 41 clause types, ground truth annotations):
t-369.23: Setup + Baseline
t-369.24: Swarm Implementation
t-369.25: Analysis
Swarm validated if:
Task Dependency Graph
The tasks should be completed roughly in this order based on dependencies:
Phase 1: Foundation (No Dependencies)
Phase 2: Core Infrastructure (Depends on t-369.1)
Phase 3: Interpreters (Depends on t-369.1, t-369.2)
Phase 4: Integration (Depends on interpreters)
Phase 5: Examples and Migration (Depends on core)
Phase 6: Cleanup and Docs
Key Design Document
Read Omni/Agent/ARCHITECTURE.md before starting any task. It contains the full design rationale and code examples.