Add retry logic for transient API errors in agent runtime

t-304·WorkTask·
·
·
·Omni/Agent/Engine.hs
Created1 month ago·Updated1 month ago

Description

Edit

Problem

When the agent makes LLM API calls, transient errors (rate limits, server errors, network timeouts) cause immediate failure. The agent should retry these automatically with exponential backoff.

Location

File: Omni/Agent/Engine.hs Functions: chat and chatWithUsage (around lines 560-650)

These functions make HTTP requests to the LLM provider but don't handle transient failures.

Solution

Add a retry wrapper with exponential backoff for recoverable errors:

import Control.Concurrent (threadDelay)

-- Retry configuration
data RetryConfig = RetryConfig
  { retryMaxAttempts :: Int      -- e.g., 3
  , retryBaseDelayMs :: Int      -- e.g., 1000 (1 second)
  , retryMaxDelayMs :: Int       -- e.g., 30000 (30 seconds)
  }

defaultRetryConfig :: RetryConfig
defaultRetryConfig = RetryConfig 3 1000 30000

-- Check if error is retryable
isRetryableError :: Int -> Bool
isRetryableError status = status `elem` [429, 500, 502, 503, 504]

-- Or for network errors
isRetryableException :: HttpException -> Bool
isRetryableException (HttpExceptionRequest _ (StatusCodeException resp _)) = 
  isRetryableError (statusCode (responseStatus resp))
isRetryableException (HttpExceptionRequest _ (ConnectionTimeout)) = True
isRetryableException (HttpExceptionRequest _ (ResponseTimeout)) = True
isRetryableException _ = False

-- Retry wrapper
retryWithBackoff :: RetryConfig -> IO (Either Text a) -> IO (Either Text a)
retryWithBackoff cfg action = go 1
  where
    go attempt
      | attempt > retryMaxAttempts cfg = action  -- Final attempt, no more retries
      | otherwise = do
          result <- action
          case result of
            Right _ -> pure result
            Left err | isTransient err -> do
              let delayMs = min (retryMaxDelayMs cfg) (retryBaseDelayMs cfg * (2 ^ (attempt - 1)))
              threadDelay (delayMs * 1000)
              go (attempt + 1)
            Left err -> pure (Left err)
    
    isTransient err = "429" `Text.isInfixOf` err 
                   || "500" `Text.isInfixOf` err
                   || "502" `Text.isInfixOf` err
                   || "503" `Text.isInfixOf` err
                   || "timeout" `Text.isInfixOf` Text.toLower err

Then wrap the chat function:

chatWithRetry :: LLM -> [Tool] -> [Message] -> IO (Either Text ChatResult)
chatWithRetry llm tools msgs = retryWithBackoff defaultRetryConfig (chat llm tools msgs)

Files to Modify

1. Omni/Agent/Engine.hs - Add retry logic to chat/chatWithUsage 2. Possibly Omni/Agent/Provider.hs if it has its own HTTP calls

Testing

1. Mock a 429 response and verify retry happens 2. Verify exponential backoff timing 3. Verify non-retryable errors fail immediately 4. Run bild --test Omni/Agent/Engine.hs

Acceptance Criteria

  • [ ] LLM API calls retry on 429, 500, 502, 503, 504 status codes
  • [ ] Retry on network timeout exceptions
  • [ ] Exponential backoff between retries (1s, 2s, 4s, etc.)
  • [ ] Maximum of 3 retry attempts by default
  • [ ] Non-retryable errors (400, 401, 404) fail immediately
  • [ ] Retry attempts are logged via engineOnActivity

Timeline (2)

🔄[human]Open → InProgress1 month ago
🔄[human]InProgress → Done1 month ago