Signal Agent Evaluation Framework (Haskell)

Date: 2026-03-11 Context: All-Haskell eval framework for the speculative fund quant pipeline.


1. Design

The eval framework lives in Omni.Fund.Quant.Eval. It’s pure Haskell — no Python anywhere. Uses the same data libraries (Data.Market, Data.Edgar, Data.Fred) as the production signal pipeline, so evals run on the exact same code path.

The eval is a first-class part of the system, not a side script. Every signal must pass eval before it influences portfolio weights.


2. Core Eval Types

module Omni.Fund.Quant.Eval where

import Numeric.LinearAlgebra (Vector, Matrix)

-- | Result of evaluating a single signal type over a historical period.
data SignalEval = SignalEval
  { seName        :: Text           -- signal type name
  , sePeriod      :: (Day, Day)     -- eval period
  , seIC          :: Double         -- mean Information Coefficient
  , seICStd       :: Double         -- IC standard deviation
  , seICIR        :: Double         -- IC / std(IC) — information ratio
  , seHitRate     :: Double         -- % of periods where IC > 0
  , seTStat       :: Double         -- t-statistic: IC / (std / sqrt(n))
  , seNPeriods    :: Int            -- number of rebalance periods evaluated
  , seDecayCurve  :: [(Int, Double)] -- (horizon_days, IC_at_horizon)
  }

-- | Result of portfolio-level backtest.
data PortfolioEval = PortfolioEval
  { peStrategy        :: Text
  , pePeriod          :: (Day, Day)
  , peTotalReturn     :: Double
  , peAnnualizedReturn :: Double
  , peSharpeRatio     :: Double
  , peMaxDrawdown     :: Double
  , peAnnualTurnover  :: Double
  , peWinRate         :: Double      -- % of periods with positive return
  , peCalmarRatio     :: Double      -- annualized return / max drawdown
  }

-- | Walk-forward configuration.
data EvalConfig = EvalConfig
  { ecUniverse      :: [Asset]
  , ecTrainStart    :: Day          -- e.g., 2020-01-01
  , ecTrainEnd      :: Day          -- e.g., 2023-12-31
  , ecTestStart     :: Day          -- e.g., 2024-01-01
  , ecTestEnd       :: Day          -- e.g., 2025-12-31
  , ecRebalanceFreq :: Int          -- days between rebalances (5 = weekly)
  , ecHorizons      :: [Int]        -- forward return horizons to test
  , ecMinIC         :: Double       -- minimum IC to pass (0.02)
  , ecMinHitRate    :: Double       -- minimum hit rate to pass (0.50)
  , ecMinTStat      :: Double       -- minimum t-stat to pass (2.0)
  }

defaultEvalConfig :: EvalConfig
defaultEvalConfig = EvalConfig
  { ecUniverse = sectorETFs ++ crossAssets
  , ecTrainStart = fromGregorian 2020 1 1
  , ecTrainEnd = fromGregorian 2023 12 31
  , ecTestStart = fromGregorian 2024 1 1
  , ecTestEnd = fromGregorian 2025 12 31
  , ecRebalanceFreq = 5
  , ecHorizons = [1, 5, 10, 20, 40, 60]
  , ecMinIC = 0.02
  , ecMinHitRate = 0.50
  , ecMinTStat = 2.0
  }

3. Layer 1: Extraction Accuracy

Tests that data libraries return correct values and signal functions compute correctly on known inputs.

-- | Unit tests for deterministic signal computations.
-- Uses known price histories with pre-computed expected values.

-- Golden test: momentum signal on synthetic data
test_momentumSignal :: Test
test_momentumSignal =
  let prices = [100, 102, 105, 103, 108, 110, 112, 115]  -- ~15% return
      expected = trailingReturn 8 (toDailyBars prices)
      actual = 0.15  -- (115 - 100) / 100
  in assertApproxEqual "momentum return" 0.01 expected actual

-- Golden test: z-score mean reversion
test_meanReversionZ :: Test
test_meanReversionZ =
  let -- Price at 2σ above 200-day SMA
      prices = replicate 200 100 ++ [120]
      z = meanReversionZ 200 (toDailyBars prices)
  in assertApproxEqual "z-score" 0.1 z 2.0  -- should be ~2σ

-- Golden test: correlation matrix
test_correlationMatrix :: Test
test_correlationMatrix =
  let -- Perfectly correlated series
      seriesA = [1, 2, 3, 4, 5] :: [Double]
      seriesB = [2, 4, 6, 8, 10] :: [Double]
      corr = correlationMatrix [toDailyBars seriesA, toDailyBars seriesB]
  in assertApproxEqual "perfect correlation" 0.01 (corr ! (0, 1)) 1.0

-- Null input tests
test_nullSignal :: Test
test_nullSignal =
  let signals = momentumSignal 20 Map.empty
  in assertEqual "no signals from empty data" 0 (Map.size signals)

-- EDGAR parsing tests (against known filings)
test_edgarParsing :: Test
test_edgarParsing = do
  let goldenFiling = loadGolden "test/golden/form4-sample.json"
  parsed <- Edgar.parseForm4 goldenFiling
  assertEqual "filer name" "Tim Cook" (itFiler parsed)
  assertEqual "tx type" Purchase (itTxType parsed)
  assertEqual "shares" 100000 (itShares parsed)

QuickCheck properties

-- Signal values are always in valid range
prop_signalBounded :: [Double] -> Property
prop_signalBounded prices =
  not (null prices) ==>
    let z = meanReversionZ 20 (toDailyBars prices)
    in z >= -10 && z <= 10  -- z-scores beyond 10 indicate a bug

-- Correlation matrix is symmetric and diagonal = 1
prop_corrSymmetric :: [[Double]] -> Property
prop_corrSymmetric series =
  length series >= 2 && all ((>= 2) . length) series ==>
    let m = correlationMatrix (map toDailyBars series)
        n = rows m
    in m == tr m  -- symmetric
       && all (\i -> abs (m ! (i, i) - 1.0) < 0.001) [0..n-1]  -- diag = 1

-- Kelly weights sum to <= 1 (no leverage)
prop_kellyNoLeverage :: Vector Double -> Matrix Double -> Property
prop_kellyNoLeverage mu sigma =
  isPositiveDefinite sigma ==>
    let f = kellyOptimalCorrelated 0.04 mu sigma
    in sumElements f <= 1.0 + 1e-10

-- Black-Litterman posterior is between prior and views
prop_blPosteriorBounded :: Property
prop_blPosteriorBounded = forAll genBLInputs $ \(mu, sigma, tau, p, q, omega) ->
  let (mu', _) = blUpdate mu sigma tau p q omega
  in all (\i -> mu' ! i >= min (mu ! i) (q ! 0) - 0.1
              && mu' ! i <= max (mu ! i) (q ! 0) + 0.1) [0..dim mu - 1]

4. Layer 2: Signal Predictive Power (Walk-Forward IC)

This is the core eval. Pure Haskell implementation.

-- | Spearman rank correlation between two vectors.
spearmanCorrelation :: [Double] -> [Double] -> Double
spearmanCorrelation xs ys =
  let n = fromIntegral (length xs)
      rankX = rank xs
      rankY = rank ys
      diffs = zipWith (-) rankX rankY
      d2 = sum (map (^2) diffs)
  in 1 - (6 * d2) / (n * (n^2 - 1))

-- | Rank a list (handling ties with average rank).
rank :: [Double] -> [Double]
rank xs = ...

-- | Measure Information Coefficient for a signal over a historical period.
--
-- Walk-forward: at each rebalance date, compute signal scores using only
-- data available as of that date, then correlate with forward returns.
evalSignalIC
  :: EvalConfig
  -> (Map Text [DailyBar] -> Map Text Signal)  -- signal function
  -> Map Text [DailyBar]                        -- full price history
  -> SignalEval
evalSignalIC config signalFn allPrices =
  let rebalDates = generateRebalDates (ecTestStart config) (ecTestEnd config) (ecRebalanceFreq config)

      -- For each rebalance date, compute IC
      ics = map (\date ->
        let -- Slice prices up to this date (no lookahead)
            pricesAsOf = Map.map (filter (\b -> dbDate b <= date)) allPrices
            -- Compute signals
            signals = signalFn pricesAsOf
            -- Get forward returns (date to date + horizon)
            fwdReturns = Map.map (forwardReturn date 20) allPrices
            -- Align: only assets with both signal and return
            aligned = Map.intersectionWith (,) (Map.map sigValue signals) fwdReturns
            (sigVals, retVals) = unzip (Map.elems aligned)
        in if length sigVals >= 5
           then Just (spearmanCorrelation sigVals retVals)
           else Nothing
        ) rebalDates

      validICs = catMaybes ics
      n = length validICs
      meanIC = if n > 0 then sum validICs / fromIntegral n else 0
      stdIC = standardDeviation validICs
      icir = if stdIC > 0 then meanIC / stdIC else 0
      hitRate = fromIntegral (length (filter (> 0) validICs)) / fromIntegral (max 1 n)
      tstat = if stdIC > 0 then meanIC / (stdIC / sqrt (fromIntegral n)) else 0

      -- Decay curve: IC at multiple horizons
      decay = map (\h ->
        let ics' = catMaybes $ map (\date ->
              let pricesAsOf = Map.map (filter (\b -> dbDate b <= date)) allPrices
                  signals = signalFn pricesAsOf
                  fwdReturns = Map.map (forwardReturn date h) allPrices
                  aligned = Map.intersectionWith (,) (Map.map sigValue signals) fwdReturns
                  (sv, rv) = unzip (Map.elems aligned)
              in if length sv >= 5 then Just (spearmanCorrelation sv rv) else Nothing
              ) rebalDates
        in (h, if null ics' then 0 else sum ics' / fromIntegral (length ics'))
        ) (ecHorizons config)

  in SignalEval
       { seName = "signal"
       , sePeriod = (ecTestStart config, ecTestEnd config)
       , seIC = meanIC
       , seICStd = stdIC
       , seICIR = icir
       , seHitRate = hitRate
       , seTStat = tstat
       , seNPeriods = n
       , seDecayCurve = decay
       }

-- | Check if a signal passes minimum quality thresholds.
signalPasses :: EvalConfig -> SignalEval -> Bool
signalPasses config eval =
  seIC eval >= ecMinIC config
  && seHitRate eval >= ecMinHitRate config
  && seTStat eval >= ecMinTStat config

5. Layer 3: Portfolio-Level Backtest

-- | Walk-forward portfolio simulation.
-- Compares signal-driven strategy to equal-weight benchmark.
evalPortfolio
  :: EvalConfig
  -> (Map Text [DailyBar] -> Map Text Signal)  -- signal function
  -> Map Text [DailyBar]                        -- full price history
  -> (PortfolioEval, PortfolioEval)             -- (strategy, benchmark)
evalPortfolio config signalFn allPrices =
  let rebalDates = generateRebalDates (ecTestStart config) (ecTestEnd config) (ecRebalanceFreq config)

      -- Strategy: signal → alpha → Kelly weights
      strategyReturns = walkForwardReturns rebalDates allPrices $ \date pricesAsOf ->
        let signals = signalFn pricesAsOf
            alphaScores = combineSignals defaultSignalWeights (Map.elems signals)
            mu = toVector alphaScores
            sigma = correlationMatrix (Map.elems pricesAsOf)  -- realized cov
            weights = kellyOptimalCorrelated 0.04 mu sigma
        in weights

      -- Benchmark: equal weight across tradeable universe
      benchmarkReturns = walkForwardReturns rebalDates allPrices $ \_ _ ->
        let n = length (filter assetTradeable (ecUniverse config))
        in fromList (replicate n (1.0 / fromIntegral n))

      stratEval = computePortfolioMetrics "signal-driven" (ecTestStart config, ecTestEnd config) strategyReturns
      benchEval = computePortfolioMetrics "equal-weight" (ecTestStart config, ecTestEnd config) benchmarkReturns

  in (stratEval, benchEval)

-- | Walk forward: at each rebalance date, compute weights, hold until next date,
-- measure return.
walkForwardReturns
  :: [Day]                                        -- rebalance dates
  -> Map Text [DailyBar]                          -- price history
  -> (Day -> Map Text [DailyBar] -> Vector Double) -- weight function
  -> [(Day, Double)]                              -- (date, period return)
walkForwardReturns dates prices weightFn =
  zipWith (\d1 d2 ->
    let pricesAsOf = Map.map (filter (\b -> dbDate b <= d1)) prices
        weights = weightFn d1 pricesAsOf
        -- Period return = weighted sum of individual asset returns
        assetReturns = Map.map (periodReturn d1 d2) prices
        portfolioReturn = dot weights (toVector assetReturns)
    in (d1, portfolioReturn)
    ) dates (tail dates)

-- | Compute standard portfolio metrics from a return series.
computePortfolioMetrics :: Text -> (Day, Day) -> [(Day, Double)] -> PortfolioEval
computePortfolioMetrics name period returns =
  let rets = map snd returns
      cumulative = scanl1 (*) (map (+ 1) rets)
      totalReturn = last cumulative - 1
      years = fromIntegral (diffDays (snd period) (fst period)) / 365.25
      annualized = (1 + totalReturn) ** (1 / years) - 1
      sharpe = mean rets / standardDeviation rets * sqrt 252
      maxDD = maxDrawdown cumulative
      winRate = fromIntegral (length (filter (> 0) rets)) / fromIntegral (length rets)
      calmar = if maxDD > 0 then annualized / maxDD else 0
  in PortfolioEval
       { peStrategy = name
       , pePeriod = period
       , peTotalReturn = totalReturn
       , peAnnualizedReturn = annualized
       , peSharpeRatio = sharpe
       , peMaxDrawdown = maxDD
       , peAnnualTurnover = 0  -- TODO: track from weight changes
       , peWinRate = winRate
       , peCalmarRatio = calmar
       }

maxDrawdown :: [Double] -> Double
maxDrawdown cumReturns =
  let peaks = scanl1 max cumReturns
      drawdowns = zipWith (\peak val -> (peak - val) / peak) peaks cumReturns
  in maximum drawdowns

6. Agent-Specific Evals (Phase 3)

These only apply when LLM agents are involved (insider + macro agents).

6.1 Hallucination Detection

-- | Feed the agent empty/null data. It must produce zero signals.
evalNullInput :: Op () -> IO Bool
evalNullInput agent = do
  result <- runOp agent emptyContext
  let signals = extractSignals result
  pure (null signals)

-- | Feed the agent contradictory data. It must produce low confidence.
evalContradictoryInput :: Op () -> IO Bool
evalContradictoryInput agent = do
  result <- runOp agent contradictoryContext
  let signals = extractSignals result
  pure (all (\s -> sigConfidence s < 0.3) signals)

6.2 Confidence Calibration

-- | Calibration curve: group signals by confidence bucket, measure actual accuracy.
-- A well-calibrated agent's 0.8-confidence signals should be correct ~80% of the time.
evalCalibration
  :: [(Signal, Double)]  -- (signal, actual forward return)
  -> [(Double, Double, Int)]  -- (predicted_confidence, actual_accuracy, n)
evalCalibration signalReturns =
  let buckets = [0.0, 0.1 .. 1.0]
      grouped = map (\lo ->
        let hi = lo + 0.1
            inBucket = filter (\(s, _) -> sigConfidence s >= lo && sigConfidence s < hi) signalReturns
            n = length inBucket
            correct = length (filter (\(s, r) -> signum (sigValue s) == signum r) inBucket)
            accuracy = if n > 0 then fromIntegral correct / fromIntegral n else 0
        in (lo + 0.05, accuracy, n)
        ) (init buckets)
  in grouped

-- | Calibration error: mean absolute difference between confidence and accuracy.
calibrationError :: [(Double, Double, Int)] -> Double
calibrationError curve =
  let weighted = [(abs (pred - actual)) * fromIntegral n | (pred, actual, n) <- curve, n > 0]
      totalN = sum [n | (_, _, n) <- curve]
  in sum weighted / fromIntegral (max 1 totalN)

6.3 Consistency (Multi-Run Variance)

-- | Run the agent N times on the same input. Signal values should be stable.
evalConsistency :: Int -> Op () -> Context -> IO Double
evalConsistency n agent ctx = do
  results <- replicateM n (runOp agent ctx)
  let signalSets = map extractSignals results
      -- For each asset, compute variance of signal values across runs
      variances = map (variance . map sigValue) (transpose signalSets)
  pure (mean variances)

6.4 Value-Add Test

-- | Does the LLM agent add value beyond deterministic signals?
-- Compare IC(deterministic + agent) vs IC(deterministic only).
evalAgentValueAdd
  :: EvalConfig
  -> (Map Text [DailyBar] -> Map Text Signal)  -- deterministic signals only
  -> (Map Text [DailyBar] -> IO (Map Text Signal))  -- deterministic + agent
  -> Map Text [DailyBar]  -- price history
  -> IO (SignalEval, SignalEval, Double)  -- (without, with, IC delta)
evalAgentValueAdd config deterministicFn agentFn prices = do
  let evalDet = evalSignalIC config deterministicFn prices
  evalAgent <- evalSignalICWithIO config agentFn prices
  pure (evalDet, evalAgent, seIC evalAgent - seIC evalDet)

7. Eval Runner

-- | Run the full eval suite and print results.
runEvalSuite :: EvalConfig -> IO ()
runEvalSuite config = do
  putStrLn "=== Signal Agent Eval Suite ==="

  -- 1. Load historical data
  putStrLn "Loading historical price data..."
  prices <- loadHistoricalPrices config

  -- 2. Layer 1: Unit tests
  putStrLn "\n--- Layer 1: Extraction Accuracy ---"
  runUnitTests

  -- 3. Layer 2: IC per signal type
  putStrLn "\n--- Layer 2: Signal Predictive Power ---"
  let signals = [("momentum", momentumSignal 63), ("mean_rev", meanReversionSignal 200),
                 ("vol_regime", volRegimeSignal)]
  forM_ signals $ \(name, fn) -> do
    let eval = evalSignalIC config fn prices
    putStrLn $ name ++ ": IC=" ++ show (seIC eval)
              ++ " ICIR=" ++ show (seICIR eval)
              ++ " hit=" ++ show (seHitRate eval)
              ++ " t=" ++ show (seTStat eval)
              ++ (if signalPasses config eval then " ✓" else " ✗ FAIL")

  -- 4. Layer 3: Portfolio backtest
  putStrLn "\n--- Layer 3: Portfolio Performance ---"
  let bestSignal = momentumSignal 63  -- or whichever passed
  let (strat, bench) = evalPortfolio config bestSignal prices
  putStrLn $ "Strategy:  Sharpe=" ++ show (peSharpeRatio strat)
            ++ " Return=" ++ show (peAnnualizedReturn strat)
            ++ " MaxDD=" ++ show (peMaxDrawdown strat)
  putStrLn $ "Benchmark: Sharpe=" ++ show (peSharpeRatio bench)
            ++ " Return=" ++ show (peAnnualizedReturn bench)
            ++ " MaxDD=" ++ show (peMaxDrawdown bench)
  putStrLn $ "Alpha: " ++ show (peAnnualizedReturn strat - peAnnualizedReturn bench)

  -- 5. Summary
  putStrLn "\n=== Summary ==="
  putStrLn $ if peSharpeRatio strat > peSharpeRatio bench
             then "PASS: Strategy outperforms benchmark"
             else "FAIL: Strategy underperforms benchmark"

8. Minimum Viable Eval (do at least these 4)

  1. Walk-forward IC for momentum signal on sector ETFs, 2020-2025 data, 20-day horizon. Must achieve IC > 0.02, t-stat > 2.0.

  2. Null input test for every signal function. Empty data → zero signals.

  3. Portfolio A/B sim on held-out data (2024-2025). Signal-driven Kelly vs equal-weight. Strategy must beat benchmark on Sharpe ratio.

  4. Property tests (QuickCheck): Kelly weights sum ≤ 1, correlation matrix is symmetric with unit diagonal, BL posterior is between prior and views.

If all four pass, Phase 1 signals are validated and we proceed to Phase 2. If any fail, the signal is rejected and we debug before continuing.


9. Anti-Overfitting


10. Live Monitoring (Phase 4)

-- | Weekly monitoring job. Runs every Monday.
weeklyMonitor :: IO ()
weeklyMonitor = do
  -- Load last 90 days of signals and returns
  signals <- loadRecentSignals 90
  returns <- loadRecentReturns 90

  -- Compute rolling IC per signal type
  let rollingIC = Map.map (\sigs ->
        let aligned = alignWithReturns sigs returns
        in spearmanCorrelation (map sigValue aligned) (map snd aligned)
        ) (groupByType signals)

  -- Alert on degraded signals
  forM_ (Map.toList rollingIC) $ \(sigType, ic) ->
    when (ic < 0.01) $
      alert $ "WARNING: " ++ show sigType ++ " IC dropped to " ++ show ic

  -- Parameter drift check
  let paramHistory = loadParamHistory 30
  when (anyLargeJumps paramHistory) $
    alert "Large parameter jump detected — review signal inputs"