Harden subagent system for reliability

t-354·WorkTask·
·
·
Created1 month ago·Updated1 month ago

Description

Edit

Problem

Subagents are currently not reliable but they need to be one of the most reliable parts of the system. The current architecture has:

  • Separate worker process (ava-worker) polling a DB job queue
  • Complex async flow with pending spawns, approval callbacks, etc.
  • No unified interface between CLI agent and Telegram-spawned subagents

Goal

There should be basically zero difference between spawning an agent via Omni/Agent.hs CLI and spawning an agent via Ava in Telegram. A subagent is "just" a call to the agent (in Ava we can use the Haskell interface, not the CLI).

Desired Telegram UX

1. User tells Ava to spawn a subagent 2. Ava sends a message describing the subagent with Approve/Reject buttons 3. After approval, a special subagent status message is sent. This message text gets updated in-place with the latest status from the subagent (phase, current activity, cost) 4. Upon completion, a final message is sent with success/failure details, summary, cost, duration

Current Architecture (what we have)

  • Omni/Agent.hs - CLI agent runner with runAgent function
  • Omni/Agent/Engine.hs - Core agent loop
  • Omni/Agent/Provider.hs - LLM API calls
  • Omni/Agent/Subagent.hs - Subagent types, spawn tools, pending spawn registry
  • Omni/Agent/Subagent/Coder.hs - Hardened coder with init/verify/commit phases
  • Omni/Agent/Subagent/Jobs.hs - DB-backed job queue
  • Omni/Agent/Subagent/Worker.hs - Separate process polling job queue
  • Omni/Agent/Telegram.hs - Bot with approval button callbacks
  • Omni/Agent/Telegram/Actions.hs - Button action handlers

Key Issues to Address

1. Unify agent spawning

  • CLI runAgent and subagent runSubagentWithCallbacks should share same core
  • Same guardrails (cost, token, iteration limits)
  • Same error handling and retry logic
  • Same status reporting interface

2. Fix status message updates

Currently Worker queues messages to scheduled_messages. Need:

  • Create initial status message with telegram_id stored
  • Use editMessageText to update that message as status changes
  • Final completion message (separate message, or edit the status message)

3. Simplify the spawn flow

Current: spawn_coder tool → pending spawn → approval button → Jobs.createJob → Worker polls → runs → queues result Desired: Keep approval button, but after approval run in-process with live status updates

4. Add comprehensive error handling

  • Provider API failures (timeout, rate limit, server error) - already added retry
  • Build/test failures in Coder - retry with fixes
  • Subagent infinite loops / stuck states - detect and terminate
  • Clean error messages back to user

5. Testing strategy

  • Unit tests for each component
  • Integration test: mock LLM, run full flow
  • Property tests for guardrail enforcement

Open Questions

  • Should we keep the separate worker process, or run subagents in-process?
  • Pro worker: process isolation, can't crash main ava
  • Pro in-process: simpler, can use Telegram API directly for status updates
  • How to handle multiple concurrent subagents?
  • Should status updates be rate-limited to avoid Telegram API spam?

Related Files

  • Omni/Agent/Subagent/DESIGN.md - Original design doc
  • Omni/Agent/Subagent/HARDENING.md - Hardening design doc

Child Tasks

  • t-354.1 - Worker sends Telegram messages directly [Done]
  • t-354.4 - Add subagent integration tests [Done]
  • t-354.3 - Debounced status message updates [Done]
  • t-354.2 - Unify Coder with Omni/Agent.hs core [Done]

Timeline (8)

🔄[human]Open → InProgress1 month ago
💬[human]1 month ago

Progress on subagent hardening:

Completed:

  • t-447: Enhanced dev.md with init/recovery phases
  • t-451: Marked Engine.hs runtime functions as deprecated

Remaining subtasks:

  • t-448: Update Developer.hs to use dev.md workflow (key change)
  • t-449: Delete Coder.hs (after t-448 verified)
  • t-450: Move Worker.hs and Jobs.hs to Omni/Ava/

Architecture decision: Everything flows through Op free monad. Coder.hs becomes a markdown workflow (dev.md), not Haskell code.

💬[human]1 month ago

More progress:

Completed:

  • t-447: ✅ Enhanced dev.md with init/recovery phases
  • t-451: ✅ Marked Engine.hs runtime functions as deprecated
  • t-448: ✅ Developer.hs now uses dev.md workflow via Agent.runAgent

Key architectural changes: 1. Agent.AgentOptions now has optOnEvent callback for programmatic use 2. Developer.hs loads dev.md workflow, appends task, runs via Agent.runAgent 3. Events flow through Op free monad to callback for Telegram status updates 4. No more dependency on Coder.hs from Developer.hs

Remaining:

  • t-449: Delete Coder.hs (now safe to do)
  • t-450: Move Worker.hs/Jobs.hs to Omni/Ava/
💬[human]1 month ago

All subtasks complete:

  • t-447 ✅ dev.md enhanced with init/recovery phases
  • t-448 ✅ Developer.hs uses dev.md via Agent.runAgent
  • t-449 ✅ Coder.hs deleted (was already gone)
  • t-450 ✅ Worker.hs/Jobs.hs moved to Omni/Ava/Worker/ (was already done)
  • t-451 ✅ Engine.hs runtime functions deprecated

Key achievements: 1. Unified agent spawning: CLI and Telegram use same Agent.runAgent path 2. Events flow through Op free monad with optOnEvent callback 3. Ava-specific code (Worker/Jobs) properly located in Omni/Ava/ 4. dev.md workflow replaces hardcoded Coder.hs logic

🔄[human]InProgress → Done1 month ago