Add retry logic for transient API errors in agent runtime

t-304·WorkTask·
·
·
·Omni/Agent/Engine.hs
Created4 months ago·Updated4 months ago·pipeline runs →

Description

Edit

Problem

When the agent makes LLM API calls, transient errors (rate limits, server errors, network timeouts) cause immediate failure. The agent should retry these automatically with exponential backoff.

Location

File: Omni/Agent/Engine.hs Functions: chat and chatWithUsage (around lines 560-650)

These functions make HTTP requests to the LLM provider but don't handle transient failures.

Solution

Add a retry wrapper with exponential backoff for recoverable errors:

import Control.Concurrent (threadDelay)

-- Retry configuration
data RetryConfig = RetryConfig
  { retryMaxAttempts :: Int      -- e.g., 3
  , retryBaseDelayMs :: Int      -- e.g., 1000 (1 second)
  , retryMaxDelayMs :: Int       -- e.g., 30000 (30 seconds)
  }

defaultRetryConfig :: RetryConfig
defaultRetryConfig = RetryConfig 3 1000 30000

-- Check if error is retryable
isRetryableError :: Int -> Bool
isRetryableError status = status `elem` [429, 500, 502, 503, 504]

-- Or for network errors
isRetryableException :: HttpException -> Bool
isRetryableException (HttpExceptionRequest _ (StatusCodeException resp _)) = 
  isRetryableError (statusCode (responseStatus resp))
isRetryableException (HttpExceptionRequest _ (ConnectionTimeout)) = True
isRetryableException (HttpExceptionRequest _ (ResponseTimeout)) = True
isRetryableException _ = False

-- Retry wrapper
retryWithBackoff :: RetryConfig -> IO (Either Text a) -> IO (Either Text a)
retryWithBackoff cfg action = go 1
  where
    go attempt
      | attempt > retryMaxAttempts cfg = action  -- Final attempt, no more retries
      | otherwise = do
          result <- action
          case result of
            Right _ -> pure result
            Left err | isTransient err -> do
              let delayMs = min (retryMaxDelayMs cfg) (retryBaseDelayMs cfg * (2 ^ (attempt - 1)))
              threadDelay (delayMs * 1000)
              go (attempt + 1)
            Left err -> pure (Left err)
    
    isTransient err = "429" `Text.isInfixOf` err 
                   || "500" `Text.isInfixOf` err
                   || "502" `Text.isInfixOf` err
                   || "503" `Text.isInfixOf` err
                   || "timeout" `Text.isInfixOf` Text.toLower err

Then wrap the chat function:

chatWithRetry :: LLM -> [Tool] -> [Message] -> IO (Either Text ChatResult)
chatWithRetry llm tools msgs = retryWithBackoff defaultRetryConfig (chat llm tools msgs)

Files to Modify

1. Omni/Agent/Engine.hs - Add retry logic to chat/chatWithUsage 2. Possibly Omni/Agent/Provider.hs if it has its own HTTP calls

Testing

1. Mock a 429 response and verify retry happens 2. Verify exponential backoff timing 3. Verify non-retryable errors fail immediately 4. Run bild --test Omni/Agent/Engine.hs

Acceptance Criteria

  • [ ] LLM API calls retry on 429, 500, 502, 503, 504 status codes
  • [ ] Retry on network timeout exceptions
  • [ ] Exponential backoff between retries (1s, 2s, 4s, etc.)
  • [ ] Maximum of 3 retry attempts by default
  • [ ] Non-retryable errors (400, 401, 404) fail immediately
  • [ ] Retry attempts are logged via engineOnActivity

Git Commits

72e886a2Add retry logic for transient API errors (t-304)
Ben Sima4 months ago1 files

Timeline (2)

🔄[human]Open → InProgress4 months ago
🔄[human]InProgress → Done4 months ago