Spike: Code-Only Agent Primitive

t-369.17·WorkTask·
·
·
·Omni/Agent.hs
Parent:t-369·Created1 month ago·Updated1 month ago

Description

Edit

Experimental spike to test the Think + Execute primitive as the foundation for agent computation.

Hypothesis

A minimal agent with only two operations - Think (LLM generates code) and Execute (run the code) - can accomplish arbitrary tasks by inventing its own tools on the fly. This may be a more fundamental primitive than our current tool-based approach.

Goals

1. Validate that code-only agents can solve real tasks 2. Measure cost/latency vs tool-based agents 3. Identify failure modes and limitations 4. Determine if this should inform the Op architecture

Experiment Design

Minimal Implementation

Create a standalone experiment (not integrated with Op yet):

-- Omni/Agent/Experiments/CodeOnly.hs

data CodeResult 
  = Success Text      -- stdout
  | Error Text        -- stderr  
  | Timeout
  | Sandbox violation

-- The two primitives
think :: Model -> Prompt -> IO Text
think model prompt = do
  response <- Provider.chat provider [] [Message User prompt]
  pure (content response)

execute :: Text -> IO CodeResult
execute code = runInSandbox defaultSandbox "python" code

-- The loop
codeOnlyAgent :: Text -> IO Text
codeOnlyAgent task = loop task 0
  where
    maxIterations = 20
    
    loop task iteration
      | iteration >= maxIterations = pure "Max iterations exceeded"
      | otherwise = do
          -- Think: generate code to accomplish task
          code <- think sonnet $ mconcat
            [ "Task: ", task
            , "\n\nWrite Python code to accomplish this task."
            , "\nPrint the final answer to stdout."
            , "\nOnly output code, no explanation."
            ]
          
          -- Execute: run it
          result <- execute code
          
          case result of
            Success output -> pure output
            Error err -> do
              -- Think again with error context
              loop (task <> "\n\nPrevious attempt failed with: " <> err) (iteration + 1)
            Timeout -> loop (task <> "\n\nPrevious attempt timed out. Try a more efficient approach.") (iteration + 1)
            SandboxViolation -> pure "Task requires capabilities outside sandbox"

Sandbox

Use a real sandbox (not just trust the model):

data Sandbox = Sandbox
  { sbTimeout :: Seconds      -- max 30s per execution
  , sbMemory :: Megabytes     -- max 512MB
  , sbNetwork :: Bool         -- allow network? (start with False)
  , sbFilesystem :: [FilePath] -- allowed paths (start with temp dir only)
  }

defaultSandbox :: Sandbox
defaultSandbox = Sandbox 30 512 False ["/tmp/agent-workspace"]

runInSandbox :: Sandbox -> Text -> Text -> IO CodeResult
runInSandbox sb lang code = do
  -- Option 1: Use nsjail/firejail
  -- Option 2: Use Docker with resource limits
  -- Option 3: Use Nix sandbox
  -- Start simple: Docker with --memory, --cpus, --network none
  ...

Test Tasks

Run the code-only agent on a variety of tasks:

Simple (should work):

  • "What is 2 + 2?"
  • "Sort this list: [3, 1, 4, 1, 5, 9]"
  • "Find the prime factors of 84"
  • "Convert 'hello world' to base64"

Medium (interesting):

  • "Parse this JSON and extract all email addresses: {...}"
  • "Find the most common word in this text: ..."
  • "Calculate the Fibonacci sequence up to 1000"
  • "Solve this equation: 2x + 5 = 13"

Hard (might fail, that's ok):

  • "Download the front page of Hacker News and list the top 5 stories" (needs network)
  • "Read the file /tmp/data.csv and plot a histogram" (needs file access)
  • "Find a bug in this code: ..."

Comparison tasks (same task, code-only vs tool-based):

  • Web search + summarize
  • File manipulation
  • Data analysis

Metrics

Track for each run:

  • Success/failure
  • Number of think/execute cycles
  • Total tokens used
  • Total cost
  • Wall clock time
  • Types of errors encountered

Questions to Answer

1. Completeness: What fraction of tasks succeed? 2. Efficiency: How does cost compare to tool-based? 3. Reliability: Does it fail gracefully or catastrophically? 4. Self-correction: Can it recover from errors? 5. Creativity: Does it invent interesting solutions?

Deliverables

1. Omni/Agent/Experiments/CodeOnly.hs - minimal implementation 2. Omni/Agent/Experiments/CodeOnly/Sandbox.hs - sandbox execution 3. Omni/Agent/Experiments/CodeOnly/Benchmark.hs - test harness 4. Omni/Agent/Experiments/CodeOnly/RESULTS.md - findings

Success Criteria

  • [ ] Agent completes >80% of simple tasks
  • [ ] Agent completes >50% of medium tasks
  • [ ] Cost is within 2x of tool-based agent for same tasks
  • [ ] Clear understanding of failure modes
  • [ ] Recommendation: adopt as primitive, or not

Non-Goals

  • Full integration with Op (that comes later if this works)
  • Production-ready sandbox (just needs to be safe enough for experiments)
  • Beautiful code (this is a spike)

Time Box

2-3 days of work. If it's not showing promise by then, write up findings and move on.

If Successful

This informs the Op architecture:

  • Think + Execute become the core primitives
  • Tools become optional (sugar for common patterns)
  • Parallel compute units become the scaling model

If Unsuccessful

Document why:

  • What tasks failed and why
  • Whether failures are fundamental or fixable
  • Recommendations for hybrid approach (code + tools)

Files to Create

  • Omni/Agent/Experiments/CodeOnly.hs
  • Omni/Agent/Experiments/CodeOnly/Sandbox.hs
  • Omni/Agent/Experiments/CodeOnly/Benchmark.hs
  • Omni/Agent/Experiments/CodeOnly/RESULTS.md

Timeline (2)

🔄[human]Open → InProgress1 month ago
🔄[human]InProgress → Done1 month ago