Experimental spike to test the Think + Execute primitive as the foundation for agent computation.
A minimal agent with only two operations - Think (LLM generates code) and Execute (run the code) - can accomplish arbitrary tasks by inventing its own tools on the fly. This may be a more fundamental primitive than our current tool-based approach.
1. Validate that code-only agents can solve real tasks 2. Measure cost/latency vs tool-based agents 3. Identify failure modes and limitations 4. Determine if this should inform the Op architecture
Create a standalone experiment (not integrated with Op yet):
-- Omni/Agent/Experiments/CodeOnly.hs
data CodeResult
= Success Text -- stdout
| Error Text -- stderr
| Timeout
| Sandbox violation
-- The two primitives
think :: Model -> Prompt -> IO Text
think model prompt = do
response <- Provider.chat provider [] [Message User prompt]
pure (content response)
execute :: Text -> IO CodeResult
execute code = runInSandbox defaultSandbox "python" code
-- The loop
codeOnlyAgent :: Text -> IO Text
codeOnlyAgent task = loop task 0
where
maxIterations = 20
loop task iteration
| iteration >= maxIterations = pure "Max iterations exceeded"
| otherwise = do
-- Think: generate code to accomplish task
code <- think sonnet $ mconcat
[ "Task: ", task
, "\n\nWrite Python code to accomplish this task."
, "\nPrint the final answer to stdout."
, "\nOnly output code, no explanation."
]
-- Execute: run it
result <- execute code
case result of
Success output -> pure output
Error err -> do
-- Think again with error context
loop (task <> "\n\nPrevious attempt failed with: " <> err) (iteration + 1)
Timeout -> loop (task <> "\n\nPrevious attempt timed out. Try a more efficient approach.") (iteration + 1)
SandboxViolation -> pure "Task requires capabilities outside sandbox"
Use a real sandbox (not just trust the model):
data Sandbox = Sandbox
{ sbTimeout :: Seconds -- max 30s per execution
, sbMemory :: Megabytes -- max 512MB
, sbNetwork :: Bool -- allow network? (start with False)
, sbFilesystem :: [FilePath] -- allowed paths (start with temp dir only)
}
defaultSandbox :: Sandbox
defaultSandbox = Sandbox 30 512 False ["/tmp/agent-workspace"]
runInSandbox :: Sandbox -> Text -> Text -> IO CodeResult
runInSandbox sb lang code = do
-- Option 1: Use nsjail/firejail
-- Option 2: Use Docker with resource limits
-- Option 3: Use Nix sandbox
-- Start simple: Docker with --memory, --cpus, --network none
...
Run the code-only agent on a variety of tasks:
Simple (should work):
Medium (interesting):
Hard (might fail, that's ok):
Comparison tasks (same task, code-only vs tool-based):
Track for each run:
1. Completeness: What fraction of tasks succeed? 2. Efficiency: How does cost compare to tool-based? 3. Reliability: Does it fail gracefully or catastrophically? 4. Self-correction: Can it recover from errors? 5. Creativity: Does it invent interesting solutions?
1. Omni/Agent/Experiments/CodeOnly.hs - minimal implementation
2. Omni/Agent/Experiments/CodeOnly/Sandbox.hs - sandbox execution
3. Omni/Agent/Experiments/CodeOnly/Benchmark.hs - test harness
4. Omni/Agent/Experiments/CodeOnly/RESULTS.md - findings
2-3 days of work. If it's not showing promise by then, write up findings and move on.
This informs the Op architecture:
Document why: