Wire up Docker sandbox for code-only agents

t-369.18·WorkTask·
·
·
·Omni/Agent.hs
Parent:t-369·Created1 month ago·Updated1 month ago

Dependencies

Description

Edit

Integrate the code-only agent with Agentd's Docker-based sandboxing.

Context

The code-only spike (t-369.17) validated that Think + Execute works, but it runs Python directly on the host without sandboxing. This is a security risk - the model can read files, access network, etc.

Agentd already has Docker-based execution. We need to wire the code-only agent to use it.

Current State

CodeOnly.hs executes Python directly:

execute :: ExperimentConfig -> Text -> IO CodeResult
execute config code = do
  -- Writes to temp file, runs python3 directly
  Process.readCreateProcessWithExitCode pythonCmd ""

Agentd has container support:

-- From Agentd.hs
toolchainImage :: Text -> Text
toolchainImage = \case
  "base" -> "agent-base:latest"
  "python" -> "agent-python:latest"
  ...

Deliverables

1. Create sandboxed execute function

-- In CodeOnly.hs or new Sandbox.hs
data SandboxConfig = SandboxConfig
  { sbImage :: Text           -- Docker image
  , sbTimeout :: Int          -- Seconds
  , sbMemory :: Text          -- e.g., "512m"
  , sbNetwork :: Bool         -- Allow network?
  , sbMounts :: [(FilePath, FilePath, Bool)]  -- (host, container, rw?)
  }

defaultSandbox :: SandboxConfig
defaultSandbox = SandboxConfig
  { sbImage = "python:3.11-slim"
  , sbTimeout = 30
  , sbMemory = "512m"
  , sbNetwork = False
  , sbMounts = []
  }

executeSandboxed :: SandboxConfig -> Text -> IO CodeResult
executeSandboxed config code = do
  -- Write code to temp file
  -- Build docker command:
  -- docker run --rm --network none --memory 512m \
  --   -v /tmp/code.py:/code.py:ro \
  --   python:3.11-slim python /code.py
  -- Parse output
  ...

2. Docker command construction

buildDockerCmd :: SandboxConfig -> FilePath -> [String]
buildDockerCmd config codePath =
  [ "run", "--rm"
  , "--network", if sbNetwork config then "bridge" else "none"
  , "--memory", Text.unpack (sbMemory config)
  , "--cpus", "1"
  , "--pids-limit", "100"  -- prevent fork bombs
  , "-v", codePath <> ":/code.py:ro"
  ] ++ mountArgs ++ 
  [ Text.unpack (sbImage config)
  , "python", "/code.py"
  ]

3. Timeout handling

Use timeout around the docker command, plus Docker's own timeout:

executeSandboxed config code = do
  let timeoutUs = sbTimeout config * 1_000_000
  result <- Timeout.timeout timeoutUs $ do
    Process.readCreateProcessWithExitCode dockerCmd ""
  case result of
    Nothing -> pure CodeTimeout
    Just (exitCode, stdout, stderr) -> ...

4. Update CodeOnly.hs to use sandbox

codeOnlyAgent :: ExperimentConfig -> Provider.Provider -> Text -> IO RunResult
codeOnlyAgent config provider task = do
  -- Use sandboxed execution
  let sandbox = defaultSandbox { sbTimeout = ecExecuteTimeout config }
  ...
  result <- executeSandboxed sandbox code

5. Re-run benchmarks with sandbox

Verify the spike results still hold with sandboxing:

  • Simple benchmark: should still be 100%
  • Medium benchmark: should still be 100%
  • Network test: should now fail (network disabled)
  • File read test: should fail (no mounts)

Testing

  • [ ] Docker sandbox executes Python correctly
  • [ ] Timeout kills runaway code
  • [ ] Memory limit prevents OOM
  • [ ] Network disabled by default
  • [ ] Fork bomb prevented
  • [ ] Benchmarks pass with sandbox

Notes

  • May need to build/pull a Python image first
  • Consider caching the image pull
  • Log sandbox violations for debugging

Files

  • Omni/Agent/Experiments/CodeOnly.hs (update)
  • Omni/Agent/Sandbox.hs (new, optional)
  • Omni/Agentd.hs (reference for existing Docker code)

Timeline (2)

🔄[human]Open → InProgress1 month ago
🔄[human]InProgress → Done1 month ago