agentd status misreports crash-looping persistent agent as running

t-784·WorkTask·
·
·
Created5 days ago·Updated5 days ago·pipeline runs →

Dependencies

Description

Edit

Observed while debugging intent-coder: systemd service was in auto-restart with exit code 127 () but {"cost_cents":null,"cwd":"/home/ben/omni/intent","error":null,"mode":"persistent","model":"gpt-5.3-codex","provider":"openai","run_id":"intent-coder","status":"running","summary":null,"systemd":"active"} reported . This hides broken agents behind a healthy status and delayed detection. Repro evidence from journal: with repeated restart counter increments. Status command should report failed/degraded when service is in restart loop or last exit non-zero.

Timeline (8)

💬[human]5 days ago

Clarified repro details (prior create command lost snippets due shell backtick expansion): At ~12:05 EDT, ● agentd-agent@intent-coder.service - Agentd Agent (intent-coder) Loaded: loaded (/home/ben/.config/systemd/user/agentd-agent@.service; disabled; preset: ignored) Active: active (running) since Tue 2026-04-14 12:07:18 EDT; 1min 58s ago Invocation: 1a769b65634e4396aaab4540ee118053 Main PID: 3494671 (.agent-wrapped) Tasks: 5 (limit: 154399) Memory: 2.2M (peak: 2.5M) CPU: 67ms CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/app-agentd\x2dagent.slice/agentd-agent@intent-coder.service └─3494671 /home/ben/omni/live/_/bin/agent --provider openai --model gpt-5.3-codex --run-id intent-coder --json

Apr 14 12:07:18 beryllium systemd[4047]: Started Agentd Agent (intent-coder). showed and journal repeatedly logged with status=127. During same window, {"cost_cents":null,"cwd":"/home/ben/omni/intent","error":null,"mode":"persistent","model":"gpt-5.3-codex","provider":"openai","run_id":"intent-coder","status":"running","summary":null,"systemd":"active"} returned status=running + systemd=active. Expected: degraded/failed/crash-loop status.

💬[human]5 days ago

Clean repro summary: intent-coder was restart-looping with exit code 127 because agent executable was not found. user systemd status showed Active=activating auto-restart and journal showed 'exec: agent: not found'. At the same time, agentd status for the same agent returned status=running and systemd=active in JSON. This should be reported as failed/degraded (or restart-loop) rather than running.

🔄[human]Open → InProgress5 days ago
💬[human]5 days ago

Implemented fix in Omni/Agentd/Daemon.hs. querySystemdState now reads systemd properties (ActiveState, SubState, Result) via systemctl show and derives state from that snapshot, instead of relying on systemctl is-active text alone. Added restart-loop detection: SubState=auto-restart now maps to failed, and non-success Result in non-running states also maps to failed. Added unit tests for restart-loop->failed, active/running->active, and inactive+failure->failed. Verification: typecheck.sh Omni/Agentd/Daemon.hs passes; bild --test Omni/Agentd/Daemon.hs passes; local built agentd now reports sd-coder and intent-coder-bad as failed when in auto-restart loop (status and systemd both failed).

🔄[human]InProgress → Review5 days ago