Free Monad Agent Architecture

t-369·Epic·
·
·
·Omni/Agent.hs
Created1 month ago·Updated1 month ago

Execution Summary

26/26
Tasks Completed
$0.00
Total Cost
0s
Total Time

Design

Edit

Refactor the agent system to use a free monad over inference primitives, enabling parallel composition, pluggable state strategies, first-class observability, and checkpointing. See Omni/Agent/ARCHITECTURE.md for full design.

Child Tasks

  • t-369.1 - Create Op.hs with OpF GADT and Free Monad [Done]
  • t-369.2 - Create Trace.hs with Event and Trace Types [Done]
  • t-369.3 - Create Sequential Interpreter [Done]
  • t-369.4 - Create Legacy Wrapper for Existing Engine [Done]
  • t-369.5 - Create Parallel Interpreter with Async [Done]
  • t-369.6 - Create CRDT State Module [Done]
  • t-369.7 - Create STM State Module [Done]
  • t-369.8 - Create Event Log State Module [Done]
  • t-369.9 - Implement Checkpoint and Resume [Done]
  • t-369.10 - Rewrite Agent Loop as Native Op [Done]
  • t-369.11 - Create Parallel Research Example Program [Done]
  • t-369.12 - Create Oracle Pattern Example [Done]
  • t-369.13 - Migrate Coder Subagent to Op [Done]
  • t-369.14 - Integrate Op with Agentd Container Runtime [Done]
  • t-369.15 - Deprecate and Remove Actor.hs [Done]
  • t-369.16 - Documentation and Examples [Done]
  • t-369.17 - Spike: Code-Only Agent Primitive [Done]
  • t-369.18 - Wire up Docker sandbox for code-only agents [Done]
  • t-369.19 - Parallel code-only agents spike [Done]
  • t-369.20 - Swarm with STM shared memory experiment [Done]
  • t-369.21 - Comprehensive benchmark: single vs parallel vs swarm [Done]
  • t-369.22 - Add correct token usage tracking [Done]
  • t-369.23 - CUAD Contract Review Benchmark - Setup and Baseline [Done]
  • t-369.24 - CUAD Contract Review - Swarm with STM Implementation [Done]
  • t-369.25 - CUAD Contract Review - Analysis and Conclusions [Done]
  • t-369.26 - Sudoku Swarm: Test STM coordination on constraint propagation [Done]

Timeline (9)

💬[engineer]1 month ago

Task Dependency Graph

The tasks should be completed roughly in this order based on dependencies:

Phase 1: Foundation (No Dependencies)

  • t-369.1 Create Op.hs - The core free monad types

Phase 2: Core Infrastructure (Depends on t-369.1)

  • t-369.2 Create Trace.hs - Event and trace types (depends on t-369.1)
  • t-369.6 Create CRDT State Module (depends on t-369.1)
  • t-369.7 Create STM State Module (depends on t-369.1)
  • t-369.8 Create Event Log State Module (depends on t-369.1)

Phase 3: Interpreters (Depends on t-369.1, t-369.2)

  • t-369.3 Create Sequential Interpreter (depends on t-369.1, t-369.2)
  • t-369.5 Create Parallel Interpreter (depends on t-369.3)

Phase 4: Integration (Depends on interpreters)

  • t-369.4 Create Legacy Wrapper (depends on t-369.3)
  • t-369.9 Implement Checkpoint/Resume (depends on t-369.2, t-369.3)
  • t-369.10 Rewrite Agent Loop as Native Op (depends on t-369.3, t-369.4)

Phase 5: Examples and Migration (Depends on core)

  • t-369.11 Parallel Research Example (depends on t-369.5, t-369.6)
  • t-369.12 Oracle Pattern Example (depends on t-369.5)
  • t-369.13 Migrate Coder Subagent (depends on t-369.10, t-369.9)

Phase 6: Cleanup and Docs

  • t-369.14 Integrate with Agentd (depends on t-369.10, t-369.5)
  • t-369.15 Deprecate Actor.hs (depends on t-369.10)
  • t-369.16 Documentation (depends on t-369.10, t-369.11)

Key Design Document

Read Omni/Agent/ARCHITECTURE.md before starting any task. It contains the full design rationale and code examples.

💬[engineer]1 month ago

Updated Plan: Spike First

Added t-369.17 (Spike: Code-Only Agent Primitive) as an early validation experiment.

Recommended Order Now:

Phase 0: Validate Core Hypothesis

  • t-369.17 Spike: Code-Only Agent (2-3 days, time-boxed)

This tests whether Think + Execute is a viable primitive before we commit to the full Op architecture. Results inform whether:

  • Tools should be first-class or just sugar over code execution
  • The Op primitives should center on Think/Execute vs Infer/Tool
  • Parallel compute units are the right scaling model

If spike succeeds: Proceed with Op architecture, but simplify to Think + Execute as core primitives.

If spike fails: Proceed with Op architecture using current Infer + Tool model, document why code-only doesn't work.

Phases 1-6: (As before, but may be modified based on spike findings)

Key Question the Spike Answers

Is this:

data OpF s next where
  Think   :: Model -> Prompt -> (Text -> next) -> OpF s next
  Execute :: Code -> (Result -> next) -> OpF s next
  Par     :: [Op s a] -> ([a] -> next) -> OpF s next
  -- tools are just code the model writes

Better than this:

data OpF s next where
  Infer :: Model -> Prompt -> (Response -> next) -> OpF s next
  Tool  :: Name -> Args -> (Result -> next) -> OpF s next
  Par   :: [Op s a] -> ([a] -> next) -> OpF s next
  -- tools are predefined capabilities

The spike will tell us.

🔄[human]Open → Done1 month ago
💬[engineer]1 month ago

Progress Update (2026-01-10)

Completed (17/22 tasks)

Phase 1-2 (Foundation) and Phase 3-5 (Integration) are DONE:

  • Op.hs free monad ✓
  • Trace.hs ✓
  • Sequential/Parallel interpreters ✓
  • CRDT/STM/EventLog state modules ✓
  • Checkpoint/Resume ✓
  • Legacy wrapper ✓
  • Example programs ✓
  • Coder migration ✓
  • Agentd integration ✓

Key Finding: Code-Only Works

The spike (t-369.17) validated Think + Execute as a fundamental primitive:

  • 100% success on simple/medium tasks
  • Single iteration (model generates correct code first try)
  • ~$0.00003 per task

Remaining (5 tasks) - Swarm Validation

| Task | Description | Status | |------|-------------|--------| | t-369.18 | Docker sandbox for code-only | Ready | | t-369.22 | Correct token tracking | Ready | | t-369.19 | Parallel code-only spike | Blocked on t-369.18 | | t-369.20 | Swarm + STM experiment | Blocked on t-369.19, t-369.7 | | t-369.21 | Comprehensive benchmark | Blocked on t-369.20 |

Next Steps

1. Wire up Docker sandbox (t-369.18) - security critical 2. Fix token tracking (t-369.22) - accuracy critical 3. Run parallel spike (t-369.19) - validate scaling 4. Run swarm experiment (t-369.20) - the big test 5. Benchmark everything (t-369.21) - data for decisions

Design Doc

See Omni/Agent/DESIGN.md for consolidated learnings and architecture.

💬[engineer]1 month ago

CUAD Contract Review Experiment (2026-01-10)

Added 3 tasks to properly validate STM coordination on a real-world task:

The Problem with Previous Experiment

The optimization task (sin*cos*sin) didn't benefit from sharing because:

  • Solutions can be found independently
  • No constraint propagation
  • No pattern learning across instances

Better Test: Contract Clause Extraction

Using CUAD dataset (500+ real contracts, 41 clause types, ground truth annotations):

t-369.23: Setup + Baseline

  • Download CUAD dataset
  • Build single-agent contract reviewer
  • Measure F1 at N=5, 10, 20, 50 contracts
  • Establish where single agent breaks down

t-369.24: Swarm Implementation

  • Reviewer agents (one per contract)
  • Pattern detection agent (finds cross-doc patterns)
  • Anomaly detection agent (flags unusual clauses)
  • Hints shared via STM ("indemnification usually in Section 8")

t-369.25: Analysis

  • Compare single vs swarm at each scale
  • Determine if sharing actually helps
  • Draw conclusions for cognitive compute vision

Why This Test is Better

  • Unstructured data (legal text, not clean math)
  • Judgment required ("is this a liability cap?")
  • Sharing should help (patterns across contracts)
  • Ground truth exists (CUAD annotations)
  • Real commercial value (contract review is expensive)

Success Criteria

Swarm validated if:

  • F1 stays high at N=50 where single agent fails
  • Patterns detected are accurate and useful
  • Later contracts benefit from early discoveries