Persist agent session state across persistent-agent restarts

t-793·WorkTask·
·
·
Created4 days ago·Updated4 days ago·pipeline runs →

Dependencies

Description

Edit

Persistent agent sessions currently lose in-memory conversation state when the process/service restarts, because agentd only persists event logs and forwards new prompts over FIFO. Design + implement restart-safe session state so persistent agents resume with prior context, not a fresh conversation. Scope should cover agent runtime state snapshotting, restore path on startup, compatibility/version checks, and recovery behavior when snapshot is missing/corrupt.

Git Commits

a711c7bffeat(agentd): restore persistent sessions from latest checkpoint
Coder Agent4 days ago3 files

Timeline (8)

🔄[human]Open → InProgress4 days ago
💬[human]4 days ago

Proposed design (MVP): use existing agent checkpoint/resume plumbing to persist full AgentState per persistent session and auto-restore on restart.

Why this approach:

  • Replaying DB messages is lossy (assistant entries are summaries only) and can re-trigger tool side effects.
  • AgentState already contains full conversation/messages and is serializable.

Design: 1) Persist a stable session snapshot every completed turn

  • In Omni.Agent.Programs.Agent.runAgent, emit a final checkpoint name after each turn completes (success/error/cancel/max-iter paths).
  • This writes with full AgentState.

2) Auto-restore on persistent process startup

  • In agentd persistent wrapper (renderAgentExecScript), set per-agent checkpoint dir:
  • If exists, pass plus to .
  • If missing/corrupt, agent logs warning and starts fresh (current behavior fallback).

3) Keep lifecycle semantics simple

  • Restart should preserve prior conversation automatically.
  • Remove/purge should also remove checkpoint directory for that agent.

4) Hardening (follow-up)

  • Validate checkpoint model compatibility ( vs runtime model) before restore; if mismatch, ignore checkpoint and start fresh.
  • Optional to bypass restore intentionally.

Scope/files likely touched:

  • Omni/Agent/Programs/Agent.hs (final checkpoint)
  • Omni/Agentd/Daemon.hs (persistent wrapper args + cleanup)
  • Omni/Agent.hs (optional resume compatibility guard)
  • tests in Agent/Agentd modules for checkpoint+resume wiring

Acceptance test:

  • send message A -> response
  • restart persistent agent
  • send message B that depends on A context
  • verify response references A (context survived restart).
💬[human]4 days ago

Design note (corrected):

Proposed MVP: use existing checkpoint/resume plumbing to persist full AgentState for each persistent agent and auto-restore on restart.

Why:

  • Replaying DB messages is lossy (assistant entries are summaries) and may re-trigger tool side effects.
  • AgentState already contains full conversation messages and is serializable.

Plan: 1) Persist stable snapshot each completed turn

  • In Omni.Agent.Programs.Agent.runAgent, write a final checkpoint named session-latest after each turn completes (success/error/cancel/max-iter).
  • Snapshot path becomes <checkpoint-dir>/session-latest.json.

2) Auto-restore on startup in persistent wrapper

  • In renderAgentExecScript, set per-agent checkpoint dir under $STATE_DIR/checkpoints/$AGENTD_AGENT_NAME.
  • Pass --checkpoint-dir always.
  • If session-latest.json exists, pass --resume with that file.
  • If missing/corrupt, fallback to fresh state (existing behavior).

3) Lifecycle

  • restart preserves context automatically.
  • remove/purge should also delete that agent checkpoint dir.

4) Hardening (follow-up)

  • Validate cpModel vs runtime model before restore.
  • Optional restart fresh mode to bypass restore intentionally.

Likely files:

  • Omni/Agent/Programs/Agent.hs
  • Omni/Agentd/Daemon.hs
  • Omni/Agent.hs (optional model-compat guard)

Acceptance test:

  • send prompt A
  • restart agent
  • send prompt B that depends on A
  • response should reference A (context survived restart)
💬[human]4 days ago

Implemented restart-safe persistent session restore in commit a711c7bf.

What changed:

  • Omni/Agent/Programs/Agent.hs: runAgent now checkpoints 'session-latest' at end of each turn on both success and budget-exhausted paths.
  • Omni/Agentd/Daemon.hs: persistent wrapper now creates per-agent checkpoint dir, always passes --checkpoint-dir, and conditionally passes --resume session-latest.json when present.
  • Omni/Agentd/Daemon.hs: rm/purge runtime cleanup now removes per-agent checkpoint directory.
  • Omni/Agentd/SPEC.md: documented checkpoint/resume behavior.

Validation:

  • typecheck.sh Omni/Agent/Programs/Agent.hs
  • typecheck.sh Omni/Agentd/Daemon.hs
  • bild --test Omni/Agent/Programs/Agent.hs
  • bild --test Omni/Agentd/Daemon.hs

Manual acceptance test (live): 1) create/start temporary persistent agent resume-smoke 2) send prompt A: remember APPLE-739 3) observed checkpoint files: init.json + session-latest.json under ~/.local/state/agentd-agents/checkpoints/resume-smoke/ 4) restart agent; verified process argv contains --resume .../session-latest.json 5) send prompt B asking for remembered code; assistant returned APPLE-739 6) agentd rm resume-smoke removed env/fifo/checkpoint runtime artifacts

🔄[human]InProgress → Review4 days ago