t-354 - omni

t-354·WorkTask···

Created1 month ago·Updated1 month ago

Description

Problem

Subagents are currently not reliable but they need to be one of the most reliable parts of the system. The current architecture has:

Separate worker process (ava-worker) polling a DB job queue
Complex async flow with pending spawns, approval callbacks, etc.
No unified interface between CLI agent and Telegram-spawned subagents

Goal

There should be basically zero difference between spawning an agent via Omni/Agent.hs CLI and spawning an agent via Ava in Telegram. A subagent is "just" a call to the agent (in Ava we can use the Haskell interface, not the CLI).

Desired Telegram UX

1. User tells Ava to spawn a subagent 2. Ava sends a message describing the subagent with Approve/Reject buttons 3. After approval, a special subagent status message is sent. This message text gets updated in-place with the latest status from the subagent (phase, current activity, cost) 4. Upon completion, a final message is sent with success/failure details, summary, cost, duration

Current Architecture (what we have)

Omni/Agent.hs - CLI agent runner with runAgent function
Omni/Agent/Engine.hs - Core agent loop
Omni/Agent/Provider.hs - LLM API calls
Omni/Agent/Subagent.hs - Subagent types, spawn tools, pending spawn registry
Omni/Agent/Subagent/Coder.hs - Hardened coder with init/verify/commit phases
Omni/Agent/Subagent/Jobs.hs - DB-backed job queue
Omni/Agent/Subagent/Worker.hs - Separate process polling job queue
Omni/Agent/Telegram.hs - Bot with approval button callbacks
Omni/Agent/Telegram/Actions.hs - Button action handlers

Key Issues to Address

1. Unify agent spawning

CLI runAgent and subagent runSubagentWithCallbacks should share same core
Same guardrails (cost, token, iteration limits)
Same error handling and retry logic
Same status reporting interface

2. Fix status message updates

Currently Worker queues messages to scheduled_messages. Need:

Create initial status message with telegram_id stored
Use editMessageText to update that message as status changes
Final completion message (separate message, or edit the status message)

3. Simplify the spawn flow

Current: spawn_coder tool → pending spawn → approval button → Jobs.createJob → Worker polls → runs → queues result Desired: Keep approval button, but after approval run in-process with live status updates

4. Add comprehensive error handling

Provider API failures (timeout, rate limit, server error) - already added retry
Build/test failures in Coder - retry with fixes
Subagent infinite loops / stuck states - detect and terminate
Clean error messages back to user

5. Testing strategy

Unit tests for each component
Integration test: mock LLM, run full flow
Property tests for guardrail enforcement

Open Questions

Should we keep the separate worker process, or run subagents in-process?
Pro worker: process isolation, can't crash main ava
Pro in-process: simpler, can use Telegram API directly for status updates
How to handle multiple concurrent subagents?
Should status updates be rate-limited to avoid Telegram API spam?

Related Files

Omni/Agent/Subagent/DESIGN.md - Original design doc
Omni/Agent/Subagent/HARDENING.md - Hardening design doc

Child Tasks

t-354.1 - Worker sends Telegram messages directly [Done]
t-354.4 - Add subagent integration tests [Done]
t-354.3 - Debounced status message updates [Done]
t-354.2 - Unify Coder with Omni/Agent.hs core [Done]

Timeline (8)

🔄[human]Open → InProgress1 month ago

💬[human]1 month ago

Progress on subagent hardening:

Completed:

t-447: Enhanced dev.md with init/recovery phases
t-451: Marked Engine.hs runtime functions as deprecated

Remaining subtasks:

t-448: Update Developer.hs to use dev.md workflow (key change)
t-449: Delete Coder.hs (after t-448 verified)
t-450: Move Worker.hs and Jobs.hs to Omni/Ava/

Architecture decision: Everything flows through Op free monad. Coder.hs becomes a markdown workflow (dev.md), not Haskell code.

💬[human]1 month ago

More progress:

Completed:

t-447: ✅ Enhanced dev.md with init/recovery phases
t-451: ✅ Marked Engine.hs runtime functions as deprecated
t-448: ✅ Developer.hs now uses dev.md workflow via Agent.runAgent

Key architectural changes: 1. Agent.AgentOptions now has optOnEvent callback for programmatic use 2. Developer.hs loads dev.md workflow, appends task, runs via Agent.runAgent 3. Events flow through Op free monad to callback for Telegram status updates 4. No more dependency on Coder.hs from Developer.hs

Remaining:

t-449: Delete Coder.hs (now safe to do)
t-450: Move Worker.hs/Jobs.hs to Omni/Ava/

💬[human]1 month ago

All subtasks complete:

t-447 ✅ dev.md enhanced with init/recovery phases
t-448 ✅ Developer.hs uses dev.md via Agent.runAgent
t-449 ✅ Coder.hs deleted (was already gone)
t-450 ✅ Worker.hs/Jobs.hs moved to Omni/Ava/Worker/ (was already done)
t-451 ✅ Engine.hs runtime functions deprecated

Key achievements: 1. Unified agent spawning: CLI and Telegram use same Agent.runAgent path 2. Events flow through Op free monad with optOnEvent callback 3. Ava-specific code (Worker/Jobs) properly located in Omni/Ava/ 4. dev.md workflow replaces hardcoded Coder.hs logic

🔄[human]InProgress → Done1 month ago

Harden subagent system for reliability

Description

Problem

Goal

Desired Telegram UX

Current Architecture (what we have)

Key Issues to Address

1. Unify agent spawning

2. Fix status message updates

3. Simplify the spawn flow

4. Add comprehensive error handling

5. Testing strategy

Open Questions

Related Files

Child Tasks

Timeline (8)