t-369 - omni

t-369·Epic····Omni/Agent.hs

Created1 month ago·Updated1 month ago

Execution Summary

26/26

Tasks Completed

$0.00

Total Cost

Total Time

Design

Edit

Refactor the agent system to use a free monad over inference primitives, enabling parallel composition, pluggable state strategies, first-class observability, and checkpointing. See Omni/Agent/ARCHITECTURE.md for full design.

Child Tasks

t-369.1 - Create Op.hs with OpF GADT and Free Monad [Done]
t-369.2 - Create Trace.hs with Event and Trace Types [Done]
t-369.3 - Create Sequential Interpreter [Done]
t-369.4 - Create Legacy Wrapper for Existing Engine [Done]
t-369.5 - Create Parallel Interpreter with Async [Done]
t-369.6 - Create CRDT State Module [Done]
t-369.7 - Create STM State Module [Done]
t-369.8 - Create Event Log State Module [Done]
t-369.9 - Implement Checkpoint and Resume [Done]
t-369.10 - Rewrite Agent Loop as Native Op [Done]
t-369.11 - Create Parallel Research Example Program [Done]
t-369.12 - Create Oracle Pattern Example [Done]
t-369.13 - Migrate Coder Subagent to Op [Done]
t-369.14 - Integrate Op with Agentd Container Runtime [Done]
t-369.15 - Deprecate and Remove Actor.hs [Done]
t-369.16 - Documentation and Examples [Done]
t-369.17 - Spike: Code-Only Agent Primitive [Done]
t-369.18 - Wire up Docker sandbox for code-only agents [Done]
t-369.19 - Parallel code-only agents spike [Done]
t-369.20 - Swarm with STM shared memory experiment [Done]
t-369.21 - Comprehensive benchmark: single vs parallel vs swarm [Done]
t-369.22 - Add correct token usage tracking [Done]
t-369.23 - CUAD Contract Review Benchmark - Setup and Baseline [Done]
t-369.24 - CUAD Contract Review - Swarm with STM Implementation [Done]
t-369.25 - CUAD Contract Review - Analysis and Conclusions [Done]
t-369.26 - Sudoku Swarm: Test STM coordination on constraint propagation [Done]

Timeline (9)

💬[engineer]1 month ago

Task Dependency Graph

The tasks should be completed roughly in this order based on dependencies:

Phase 1: Foundation (No Dependencies)

t-369.1 Create Op.hs - The core free monad types

Phase 2: Core Infrastructure (Depends on t-369.1)

t-369.2 Create Trace.hs - Event and trace types (depends on t-369.1)
t-369.6 Create CRDT State Module (depends on t-369.1)
t-369.7 Create STM State Module (depends on t-369.1)
t-369.8 Create Event Log State Module (depends on t-369.1)

Phase 3: Interpreters (Depends on t-369.1, t-369.2)

t-369.3 Create Sequential Interpreter (depends on t-369.1, t-369.2)
t-369.5 Create Parallel Interpreter (depends on t-369.3)

Phase 4: Integration (Depends on interpreters)

t-369.4 Create Legacy Wrapper (depends on t-369.3)
t-369.9 Implement Checkpoint/Resume (depends on t-369.2, t-369.3)
t-369.10 Rewrite Agent Loop as Native Op (depends on t-369.3, t-369.4)

Phase 5: Examples and Migration (Depends on core)

t-369.11 Parallel Research Example (depends on t-369.5, t-369.6)
t-369.12 Oracle Pattern Example (depends on t-369.5)
t-369.13 Migrate Coder Subagent (depends on t-369.10, t-369.9)

Phase 6: Cleanup and Docs

t-369.14 Integrate with Agentd (depends on t-369.10, t-369.5)
t-369.15 Deprecate Actor.hs (depends on t-369.10)
t-369.16 Documentation (depends on t-369.10, t-369.11)

Key Design Document

Read Omni/Agent/ARCHITECTURE.md before starting any task. It contains the full design rationale and code examples.

💬[engineer]1 month ago

Updated Plan: Spike First

Added t-369.17 (Spike: Code-Only Agent Primitive) as an early validation experiment.

Recommended Order Now:

Phase 0: Validate Core Hypothesis

t-369.17 Spike: Code-Only Agent (2-3 days, time-boxed)

This tests whether Think + Execute is a viable primitive before we commit to the full Op architecture. Results inform whether:

Tools should be first-class or just sugar over code execution
The Op primitives should center on Think/Execute vs Infer/Tool
Parallel compute units are the right scaling model

If spike succeeds: Proceed with Op architecture, but simplify to Think + Execute as core primitives.

If spike fails: Proceed with Op architecture using current Infer + Tool model, document why code-only doesn't work.

Phases 1-6: (As before, but may be modified based on spike findings)

Key Question the Spike Answers

Is this:

data OpF s next where
  Think   :: Model -> Prompt -> (Text -> next) -> OpF s next
  Execute :: Code -> (Result -> next) -> OpF s next
  Par     :: [Op s a] -> ([a] -> next) -> OpF s next
  -- tools are just code the model writes

Better than this:

data OpF s next where
  Infer :: Model -> Prompt -> (Response -> next) -> OpF s next
  Tool  :: Name -> Args -> (Result -> next) -> OpF s next
  Par   :: [Op s a] -> ([a] -> next) -> OpF s next
  -- tools are predefined capabilities

The spike will tell us.

🔄[human]Open → Done1 month ago

💬[engineer]1 month ago

Progress Update (2026-01-10)

Completed (17/22 tasks)

Phase 1-2 (Foundation) and Phase 3-5 (Integration) are DONE:

Op.hs free monad ✓
Trace.hs ✓
Sequential/Parallel interpreters ✓
CRDT/STM/EventLog state modules ✓
Checkpoint/Resume ✓
Legacy wrapper ✓
Example programs ✓
Coder migration ✓
Agentd integration ✓

Key Finding: Code-Only Works

The spike (t-369.17) validated Think + Execute as a fundamental primitive:

100% success on simple/medium tasks
Single iteration (model generates correct code first try)
~$0.00003 per task

Remaining (5 tasks) - Swarm Validation

| Task | Description | Status | |------|-------------|--------| | t-369.18 | Docker sandbox for code-only | Ready | | t-369.22 | Correct token tracking | Ready | | t-369.19 | Parallel code-only spike | Blocked on t-369.18 | | t-369.20 | Swarm + STM experiment | Blocked on t-369.19, t-369.7 | | t-369.21 | Comprehensive benchmark | Blocked on t-369.20 |

Next Steps

1. Wire up Docker sandbox (t-369.18) - security critical 2. Fix token tracking (t-369.22) - accuracy critical 3. Run parallel spike (t-369.19) - validate scaling 4. Run swarm experiment (t-369.20) - the big test 5. Benchmark everything (t-369.21) - data for decisions

Design Doc

See Omni/Agent/DESIGN.md for consolidated learnings and architecture.

💬[engineer]1 month ago

CUAD Contract Review Experiment (2026-01-10)

Added 3 tasks to properly validate STM coordination on a real-world task:

The Problem with Previous Experiment

The optimization task (sin*cos*sin) didn't benefit from sharing because:

Solutions can be found independently
No constraint propagation
No pattern learning across instances

Better Test: Contract Clause Extraction

Using CUAD dataset (500+ real contracts, 41 clause types, ground truth annotations):

t-369.23: Setup + Baseline

Download CUAD dataset
Build single-agent contract reviewer
Measure F1 at N=5, 10, 20, 50 contracts
Establish where single agent breaks down

t-369.24: Swarm Implementation

Reviewer agents (one per contract)
Pattern detection agent (finds cross-doc patterns)
Anomaly detection agent (flags unusual clauses)
Hints shared via STM ("indemnification usually in Section 8")

t-369.25: Analysis

Compare single vs swarm at each scale
Determine if sharing actually helps
Draw conclusions for cognitive compute vision

Why This Test is Better

Unstructured data (legal text, not clean math)
Judgment required ("is this a liability cap?")
Sharing should help (patterns across contracts)
Ground truth exists (CUAD annotations)
Real commercial value (contract review is expensive)

Success Criteria

Swarm validated if:

F1 stays high at N=50 where single agent fails
Patterns detected are accurate and useful
Later contracts benefit from early discoveries

Free Monad Agent Architecture

Execution Summary

Design

Child Tasks

Timeline (9)

Task Dependency Graph

Phase 1: Foundation (No Dependencies)

Phase 2: Core Infrastructure (Depends on t-369.1)

Phase 3: Interpreters (Depends on t-369.1, t-369.2)

Phase 4: Integration (Depends on interpreters)

Phase 5: Examples and Migration (Depends on core)

Phase 6: Cleanup and Docs

Key Design Document

Updated Plan: Spike First

Recommended Order Now:

Key Question the Spike Answers

Progress Update (2026-01-10)

Completed (17/22 tasks)

Key Finding: Code-Only Works

Remaining (5 tasks) - Swarm Validation

Next Steps

Design Doc

CUAD Contract Review Experiment (2026-01-10)

The Problem with Previous Experiment

Better Test: Contract Clause Extraction

Why This Test is Better

Success Criteria