When the agent makes LLM API calls, transient errors (rate limits, server errors, network timeouts) cause immediate failure. The agent should retry these automatically with exponential backoff.
File: Omni/Agent/Engine.hs
Functions: chat and chatWithUsage (around lines 560-650)
These functions make HTTP requests to the LLM provider but don't handle transient failures.
Add a retry wrapper with exponential backoff for recoverable errors:
import Control.Concurrent (threadDelay)
-- Retry configuration
data RetryConfig = RetryConfig
{ retryMaxAttempts :: Int -- e.g., 3
, retryBaseDelayMs :: Int -- e.g., 1000 (1 second)
, retryMaxDelayMs :: Int -- e.g., 30000 (30 seconds)
}
defaultRetryConfig :: RetryConfig
defaultRetryConfig = RetryConfig 3 1000 30000
-- Check if error is retryable
isRetryableError :: Int -> Bool
isRetryableError status = status `elem` [429, 500, 502, 503, 504]
-- Or for network errors
isRetryableException :: HttpException -> Bool
isRetryableException (HttpExceptionRequest _ (StatusCodeException resp _)) =
isRetryableError (statusCode (responseStatus resp))
isRetryableException (HttpExceptionRequest _ (ConnectionTimeout)) = True
isRetryableException (HttpExceptionRequest _ (ResponseTimeout)) = True
isRetryableException _ = False
-- Retry wrapper
retryWithBackoff :: RetryConfig -> IO (Either Text a) -> IO (Either Text a)
retryWithBackoff cfg action = go 1
where
go attempt
| attempt > retryMaxAttempts cfg = action -- Final attempt, no more retries
| otherwise = do
result <- action
case result of
Right _ -> pure result
Left err | isTransient err -> do
let delayMs = min (retryMaxDelayMs cfg) (retryBaseDelayMs cfg * (2 ^ (attempt - 1)))
threadDelay (delayMs * 1000)
go (attempt + 1)
Left err -> pure (Left err)
isTransient err = "429" `Text.isInfixOf` err
|| "500" `Text.isInfixOf` err
|| "502" `Text.isInfixOf` err
|| "503" `Text.isInfixOf` err
|| "timeout" `Text.isInfixOf` Text.toLower err
Then wrap the chat function:
chatWithRetry :: LLM -> [Tool] -> [Message] -> IO (Either Text ChatResult)
chatWithRetry llm tools msgs = retryWithBackoff defaultRetryConfig (chat llm tools msgs)
1. Omni/Agent/Engine.hs - Add retry logic to chat/chatWithUsage
2. Possibly Omni/Agent/Provider.hs if it has its own HTTP calls
1. Mock a 429 response and verify retry happens
2. Verify exponential backoff timing
3. Verify non-retryable errors fail immediately
4. Run bild --test Omni/Agent/Engine.hs
engineOnActivity