Signal Agent Evaluation Framework (Haskell)

Date: 2026-03-11 Context: All-Haskell eval framework for the speculative fund quant pipeline.

1. Design

The eval framework lives in Omni.Fund.Quant.Eval. It’s pure Haskell — no Python anywhere. Uses the same data libraries (Data.Market, Data.Edgar, Data.Fred) as the production signal pipeline, so evals run on the exact same code path.

The eval is a first-class part of the system, not a side script. Every signal must pass eval before it influences portfolio weights.

2. Core Eval Types

module Omni.Fund.Quant.Eval where

import Numeric.LinearAlgebra (Vector, Matrix)

-- | Result of evaluating a single signal type over a historical period.
data SignalEval = SignalEval
  { seName        :: Text           -- signal type name
  , sePeriod      :: (Day, Day)     -- eval period
  , seIC          :: Double         -- mean Information Coefficient
  , seICStd       :: Double         -- IC standard deviation
  , seICIR        :: Double         -- IC / std(IC) — information ratio
  , seHitRate     :: Double         -- % of periods where IC > 0
  , seTStat       :: Double         -- t-statistic: IC / (std / sqrt(n))
  , seNPeriods    :: Int            -- number of rebalance periods evaluated
  , seDecayCurve  :: [(Int, Double)] -- (horizon_days, IC_at_horizon)
  }

-- | Result of portfolio-level backtest.
data PortfolioEval = PortfolioEval
  { peStrategy        :: Text
  , pePeriod          :: (Day, Day)
  , peTotalReturn     :: Double
  , peAnnualizedReturn :: Double
  , peSharpeRatio     :: Double
  , peMaxDrawdown     :: Double
  , peAnnualTurnover  :: Double
  , peWinRate         :: Double      -- % of periods with positive return
  , peCalmarRatio     :: Double      -- annualized return / max drawdown
  }

-- | Walk-forward configuration.
data EvalConfig = EvalConfig
  { ecUniverse      :: [Asset]
  , ecTrainStart    :: Day          -- e.g., 2020-01-01
  , ecTrainEnd      :: Day          -- e.g., 2023-12-31
  , ecTestStart     :: Day          -- e.g., 2024-01-01
  , ecTestEnd       :: Day          -- e.g., 2025-12-31
  , ecRebalanceFreq :: Int          -- days between rebalances (5 = weekly)
  , ecHorizons      :: [Int]        -- forward return horizons to test
  , ecMinIC         :: Double       -- minimum IC to pass (0.02)
  , ecMinHitRate    :: Double       -- minimum hit rate to pass (0.50)
  , ecMinTStat      :: Double       -- minimum t-stat to pass (2.0)
  }

defaultEvalConfig :: EvalConfig
defaultEvalConfig = EvalConfig
  { ecUniverse = sectorETFs ++ crossAssets
  , ecTrainStart = fromGregorian 2020 1 1
  , ecTrainEnd = fromGregorian 2023 12 31
  , ecTestStart = fromGregorian 2024 1 1
  , ecTestEnd = fromGregorian 2025 12 31
  , ecRebalanceFreq = 5
  , ecHorizons = [1, 5, 10, 20, 40, 60]
  , ecMinIC = 0.02
  , ecMinHitRate = 0.50
  , ecMinTStat = 2.0
  }

3. Layer 1: Extraction Accuracy

Tests that data libraries return correct values and signal functions compute correctly on known inputs.

-- | Unit tests for deterministic signal computations.
-- Uses known price histories with pre-computed expected values.

-- Golden test: momentum signal on synthetic data
test_momentumSignal :: Test
test_momentumSignal =
  let prices = [100, 102, 105, 103, 108, 110, 112, 115]  -- ~15% return
      expected = trailingReturn 8 (toDailyBars prices)
      actual = 0.15  -- (115 - 100) / 100
  in assertApproxEqual "momentum return" 0.01 expected actual

-- Golden test: z-score mean reversion
test_meanReversionZ :: Test
test_meanReversionZ =
  let -- Price at 2σ above 200-day SMA
      prices = replicate 200 100 ++ [120]
      z = meanReversionZ 200 (toDailyBars prices)
  in assertApproxEqual "z-score" 0.1 z 2.0  -- should be ~2σ

-- Golden test: correlation matrix
test_correlationMatrix :: Test
test_correlationMatrix =
  let -- Perfectly correlated series
      seriesA = [1, 2, 3, 4, 5] :: [Double]
      seriesB = [2, 4, 6, 8, 10] :: [Double]
      corr = correlationMatrix [toDailyBars seriesA, toDailyBars seriesB]
  in assertApproxEqual "perfect correlation" 0.01 (corr ! (0, 1)) 1.0

-- Null input tests
test_nullSignal :: Test
test_nullSignal =
  let signals = momentumSignal 20 Map.empty
  in assertEqual "no signals from empty data" 0 (Map.size signals)

-- EDGAR parsing tests (against known filings)
test_edgarParsing :: Test
test_edgarParsing = do
  let goldenFiling = loadGolden "test/golden/form4-sample.json"
  parsed <- Edgar.parseForm4 goldenFiling
  assertEqual "filer name" "Tim Cook" (itFiler parsed)
  assertEqual "tx type" Purchase (itTxType parsed)
  assertEqual "shares" 100000 (itShares parsed)

QuickCheck properties

-- Signal values are always in valid range
prop_signalBounded :: [Double] -> Property
prop_signalBounded prices =
  not (null prices) ==>
    let z = meanReversionZ 20 (toDailyBars prices)
    in z >= -10 && z <= 10  -- z-scores beyond 10 indicate a bug

-- Correlation matrix is symmetric and diagonal = 1
prop_corrSymmetric :: [[Double]] -> Property
prop_corrSymmetric series =
  length series >= 2 && all ((>= 2) . length) series ==>
    let m = correlationMatrix (map toDailyBars series)
        n = rows m
    in m == tr m  -- symmetric
       && all (\i -> abs (m ! (i, i) - 1.0) < 0.001) [0..n-1]  -- diag = 1

-- Kelly weights sum to <= 1 (no leverage)
prop_kellyNoLeverage :: Vector Double -> Matrix Double -> Property
prop_kellyNoLeverage mu sigma =
  isPositiveDefinite sigma ==>
    let f = kellyOptimalCorrelated 0.04 mu sigma
    in sumElements f <= 1.0 + 1e-10

-- Black-Litterman posterior is between prior and views
prop_blPosteriorBounded :: Property
prop_blPosteriorBounded = forAll genBLInputs $ \(mu, sigma, tau, p, q, omega) ->
  let (mu', _) = blUpdate mu sigma tau p q omega
  in all (\i -> mu' ! i >= min (mu ! i) (q ! 0) - 0.1
              && mu' ! i <= max (mu ! i) (q ! 0) + 0.1) [0..dim mu - 1]

4. Layer 2: Signal Predictive Power (Walk-Forward IC)

This is the core eval. Pure Haskell implementation.

-- | Spearman rank correlation between two vectors.
spearmanCorrelation :: [Double] -> [Double] -> Double
spearmanCorrelation xs ys =
  let n = fromIntegral (length xs)
      rankX = rank xs
      rankY = rank ys
      diffs = zipWith (-) rankX rankY
      d2 = sum (map (^2) diffs)
  in 1 - (6 * d2) / (n * (n^2 - 1))

-- | Rank a list (handling ties with average rank).
rank :: [Double] -> [Double]
rank xs = ...

-- | Measure Information Coefficient for a signal over a historical period.
--
-- Walk-forward: at each rebalance date, compute signal scores using only
-- data available as of that date, then correlate with forward returns.
evalSignalIC
  :: EvalConfig
  -> (Map Text [DailyBar] -> Map Text Signal)  -- signal function
  -> Map Text [DailyBar]                        -- full price history
  -> SignalEval
evalSignalIC config signalFn allPrices =
  let rebalDates = generateRebalDates (ecTestStart config) (ecTestEnd config) (ecRebalanceFreq config)

      -- For each rebalance date, compute IC
      ics = map (\date ->
        let -- Slice prices up to this date (no lookahead)
            pricesAsOf = Map.map (filter (\b -> dbDate b <= date)) allPrices
            -- Compute signals
            signals = signalFn pricesAsOf
            -- Get forward returns (date to date + horizon)
            fwdReturns = Map.map (forwardReturn date 20) allPrices
            -- Align: only assets with both signal and return
            aligned = Map.intersectionWith (,) (Map.map sigValue signals) fwdReturns
            (sigVals, retVals) = unzip (Map.elems aligned)
        in if length sigVals >= 5
           then Just (spearmanCorrelation sigVals retVals)
           else Nothing
        ) rebalDates

      validICs = catMaybes ics
      n = length validICs
      meanIC = if n > 0 then sum validICs / fromIntegral n else 0
      stdIC = standardDeviation validICs
      icir = if stdIC > 0 then meanIC / stdIC else 0
      hitRate = fromIntegral (length (filter (> 0) validICs)) / fromIntegral (max 1 n)
      tstat = if stdIC > 0 then meanIC / (stdIC / sqrt (fromIntegral n)) else 0

      -- Decay curve: IC at multiple horizons
      decay = map (\h ->
        let ics' = catMaybes $ map (\date ->
              let pricesAsOf = Map.map (filter (\b -> dbDate b <= date)) allPrices
                  signals = signalFn pricesAsOf
                  fwdReturns = Map.map (forwardReturn date h) allPrices
                  aligned = Map.intersectionWith (,) (Map.map sigValue signals) fwdReturns
                  (sv, rv) = unzip (Map.elems aligned)
              in if length sv >= 5 then Just (spearmanCorrelation sv rv) else Nothing
              ) rebalDates
        in (h, if null ics' then 0 else sum ics' / fromIntegral (length ics'))
        ) (ecHorizons config)

  in SignalEval
       { seName = "signal"
       , sePeriod = (ecTestStart config, ecTestEnd config)
       , seIC = meanIC
       , seICStd = stdIC
       , seICIR = icir
       , seHitRate = hitRate
       , seTStat = tstat
       , seNPeriods = n
       , seDecayCurve = decay
       }

-- | Check if a signal passes minimum quality thresholds.
signalPasses :: EvalConfig -> SignalEval -> Bool
signalPasses config eval =
  seIC eval >= ecMinIC config
  && seHitRate eval >= ecMinHitRate config
  && seTStat eval >= ecMinTStat config

5. Layer 3: Portfolio-Level Backtest

-- | Walk-forward portfolio simulation.
-- Compares signal-driven strategy to equal-weight benchmark.
evalPortfolio
  :: EvalConfig
  -> (Map Text [DailyBar] -> Map Text Signal)  -- signal function
  -> Map Text [DailyBar]                        -- full price history
  -> (PortfolioEval, PortfolioEval)             -- (strategy, benchmark)
evalPortfolio config signalFn allPrices =
  let rebalDates = generateRebalDates (ecTestStart config) (ecTestEnd config) (ecRebalanceFreq config)

      -- Strategy: signal → alpha → Kelly weights
      strategyReturns = walkForwardReturns rebalDates allPrices $ \date pricesAsOf ->
        let signals = signalFn pricesAsOf
            alphaScores = combineSignals defaultSignalWeights (Map.elems signals)
            mu = toVector alphaScores
            sigma = correlationMatrix (Map.elems pricesAsOf)  -- realized cov
            weights = kellyOptimalCorrelated 0.04 mu sigma
        in weights

      -- Benchmark: equal weight across tradeable universe
      benchmarkReturns = walkForwardReturns rebalDates allPrices $ \_ _ ->
        let n = length (filter assetTradeable (ecUniverse config))
        in fromList (replicate n (1.0 / fromIntegral n))

      stratEval = computePortfolioMetrics "signal-driven" (ecTestStart config, ecTestEnd config) strategyReturns
      benchEval = computePortfolioMetrics "equal-weight" (ecTestStart config, ecTestEnd config) benchmarkReturns

  in (stratEval, benchEval)

-- | Walk forward: at each rebalance date, compute weights, hold until next date,
-- measure return.
walkForwardReturns
  :: [Day]                                        -- rebalance dates
  -> Map Text [DailyBar]                          -- price history
  -> (Day -> Map Text [DailyBar] -> Vector Double) -- weight function
  -> [(Day, Double)]                              -- (date, period return)
walkForwardReturns dates prices weightFn =
  zipWith (\d1 d2 ->
    let pricesAsOf = Map.map (filter (\b -> dbDate b <= d1)) prices
        weights = weightFn d1 pricesAsOf
        -- Period return = weighted sum of individual asset returns
        assetReturns = Map.map (periodReturn d1 d2) prices
        portfolioReturn = dot weights (toVector assetReturns)
    in (d1, portfolioReturn)
    ) dates (tail dates)

-- | Compute standard portfolio metrics from a return series.
computePortfolioMetrics :: Text -> (Day, Day) -> [(Day, Double)] -> PortfolioEval
computePortfolioMetrics name period returns =
  let rets = map snd returns
      cumulative = scanl1 (*) (map (+ 1) rets)
      totalReturn = last cumulative - 1
      years = fromIntegral (diffDays (snd period) (fst period)) / 365.25
      annualized = (1 + totalReturn) ** (1 / years) - 1
      sharpe = mean rets / standardDeviation rets * sqrt 252
      maxDD = maxDrawdown cumulative
      winRate = fromIntegral (length (filter (> 0) rets)) / fromIntegral (length rets)
      calmar = if maxDD > 0 then annualized / maxDD else 0
  in PortfolioEval
       { peStrategy = name
       , pePeriod = period
       , peTotalReturn = totalReturn
       , peAnnualizedReturn = annualized
       , peSharpeRatio = sharpe
       , peMaxDrawdown = maxDD
       , peAnnualTurnover = 0  -- TODO: track from weight changes
       , peWinRate = winRate
       , peCalmarRatio = calmar
       }

maxDrawdown :: [Double] -> Double
maxDrawdown cumReturns =
  let peaks = scanl1 max cumReturns
      drawdowns = zipWith (\peak val -> (peak - val) / peak) peaks cumReturns
  in maximum drawdowns

6. Agent-Specific Evals (Phase 3)

These only apply when LLM agents are involved (insider + macro agents).

6.1 Hallucination Detection

-- | Feed the agent empty/null data. It must produce zero signals.
evalNullInput :: Op () -> IO Bool
evalNullInput agent = do
  result <- runOp agent emptyContext
  let signals = extractSignals result
  pure (null signals)

-- | Feed the agent contradictory data. It must produce low confidence.
evalContradictoryInput :: Op () -> IO Bool
evalContradictoryInput agent = do
  result <- runOp agent contradictoryContext
  let signals = extractSignals result
  pure (all (\s -> sigConfidence s < 0.3) signals)

6.2 Confidence Calibration

-- | Calibration curve: group signals by confidence bucket, measure actual accuracy.
-- A well-calibrated agent's 0.8-confidence signals should be correct ~80% of the time.
evalCalibration
  :: [(Signal, Double)]  -- (signal, actual forward return)
  -> [(Double, Double, Int)]  -- (predicted_confidence, actual_accuracy, n)
evalCalibration signalReturns =
  let buckets = [0.0, 0.1 .. 1.0]
      grouped = map (\lo ->
        let hi = lo + 0.1
            inBucket = filter (\(s, _) -> sigConfidence s >= lo && sigConfidence s < hi) signalReturns
            n = length inBucket
            correct = length (filter (\(s, r) -> signum (sigValue s) == signum r) inBucket)
            accuracy = if n > 0 then fromIntegral correct / fromIntegral n else 0
        in (lo + 0.05, accuracy, n)
        ) (init buckets)
  in grouped

-- | Calibration error: mean absolute difference between confidence and accuracy.
calibrationError :: [(Double, Double, Int)] -> Double
calibrationError curve =
  let weighted = [(abs (pred - actual)) * fromIntegral n | (pred, actual, n) <- curve, n > 0]
      totalN = sum [n | (_, _, n) <- curve]
  in sum weighted / fromIntegral (max 1 totalN)

6.3 Consistency (Multi-Run Variance)

-- | Run the agent N times on the same input. Signal values should be stable.
evalConsistency :: Int -> Op () -> Context -> IO Double
evalConsistency n agent ctx = do
  results <- replicateM n (runOp agent ctx)
  let signalSets = map extractSignals results
      -- For each asset, compute variance of signal values across runs
      variances = map (variance . map sigValue) (transpose signalSets)
  pure (mean variances)

6.4 Value-Add Test

-- | Does the LLM agent add value beyond deterministic signals?
-- Compare IC(deterministic + agent) vs IC(deterministic only).
evalAgentValueAdd
  :: EvalConfig
  -> (Map Text [DailyBar] -> Map Text Signal)  -- deterministic signals only
  -> (Map Text [DailyBar] -> IO (Map Text Signal))  -- deterministic + agent
  -> Map Text [DailyBar]  -- price history
  -> IO (SignalEval, SignalEval, Double)  -- (without, with, IC delta)
evalAgentValueAdd config deterministicFn agentFn prices = do
  let evalDet = evalSignalIC config deterministicFn prices
  evalAgent <- evalSignalICWithIO config agentFn prices
  pure (evalDet, evalAgent, seIC evalAgent - seIC evalDet)

7. Eval Runner

-- | Run the full eval suite and print results.
runEvalSuite :: EvalConfig -> IO ()
runEvalSuite config = do
  putStrLn "=== Signal Agent Eval Suite ==="

  -- 1. Load historical data
  putStrLn "Loading historical price data..."
  prices <- loadHistoricalPrices config

  -- 2. Layer 1: Unit tests
  putStrLn "\n--- Layer 1: Extraction Accuracy ---"
  runUnitTests

  -- 3. Layer 2: IC per signal type
  putStrLn "\n--- Layer 2: Signal Predictive Power ---"
  let signals = [("momentum", momentumSignal 63), ("mean_rev", meanReversionSignal 200),
                 ("vol_regime", volRegimeSignal)]
  forM_ signals $ \(name, fn) -> do
    let eval = evalSignalIC config fn prices
    putStrLn $ name ++ ": IC=" ++ show (seIC eval)
              ++ " ICIR=" ++ show (seICIR eval)
              ++ " hit=" ++ show (seHitRate eval)
              ++ " t=" ++ show (seTStat eval)
              ++ (if signalPasses config eval then " ✓" else " ✗ FAIL")

  -- 4. Layer 3: Portfolio backtest
  putStrLn "\n--- Layer 3: Portfolio Performance ---"
  let bestSignal = momentumSignal 63  -- or whichever passed
  let (strat, bench) = evalPortfolio config bestSignal prices
  putStrLn $ "Strategy:  Sharpe=" ++ show (peSharpeRatio strat)
            ++ " Return=" ++ show (peAnnualizedReturn strat)
            ++ " MaxDD=" ++ show (peMaxDrawdown strat)
  putStrLn $ "Benchmark: Sharpe=" ++ show (peSharpeRatio bench)
            ++ " Return=" ++ show (peAnnualizedReturn bench)
            ++ " MaxDD=" ++ show (peMaxDrawdown bench)
  putStrLn $ "Alpha: " ++ show (peAnnualizedReturn strat - peAnnualizedReturn bench)

  -- 5. Summary
  putStrLn "\n=== Summary ==="
  putStrLn $ if peSharpeRatio strat > peSharpeRatio bench
             then "PASS: Strategy outperforms benchmark"
             else "FAIL: Strategy underperforms benchmark"

8. Minimum Viable Eval (do at least these 4)

Walk-forward IC for momentum signal on sector ETFs, 2020-2025 data, 20-day horizon. Must achieve IC > 0.02, t-stat > 2.0.
Null input test for every signal function. Empty data → zero signals.
Portfolio A/B sim on held-out data (2024-2025). Signal-driven Kelly vs equal-weight. Strategy must beat benchmark on Sharpe ratio.
Property tests (QuickCheck): Kelly weights sum ≤ 1, correlation matrix is symmetric with unit diagonal, BL posterior is between prior and views.

If all four pass, Phase 1 signals are validated and we proceed to Phase 2. If any fail, the signal is rejected and we debug before continuing.

9. Anti-Overfitting

Strict temporal separation: Train on 2020-2023. Test on 2024-2025. Never cross.
Multiple testing correction: If testing N signal variants, require t-stat > 2.0 + log(N). For 5 signal types: t > 2.6.
Regime robustness: IC must be positive in both 2020-2021 (bull/crash) and 2022-2023 (bear/recovery). A signal that only works in one regime is fragile.
No parameter tuning on test set. All hyperparameters (lookback windows, signal weights) are fixed before the test period. We don’t peek.
Paper trading period: After backtests pass, run live for 1-3 months before committing capital. Compare live predictions to realized returns.

10. Live Monitoring (Phase 4)

-- | Weekly monitoring job. Runs every Monday.
weeklyMonitor :: IO ()
weeklyMonitor = do
  -- Load last 90 days of signals and returns
  signals <- loadRecentSignals 90
  returns <- loadRecentReturns 90

  -- Compute rolling IC per signal type
  let rollingIC = Map.map (\sigs ->
        let aligned = alignWithReturns sigs returns
        in spearmanCorrelation (map sigValue aligned) (map snd aligned)
        ) (groupByType signals)

  -- Alert on degraded signals
  forM_ (Map.toList rollingIC) $ \(sigType, ic) ->
    when (ic < 0.01) $
      alert $ "WARNING: " ++ show sigType ++ " IC dropped to " ++ show ic

  -- Parameter drift check
  let paramHistory = loadParamHistory 30
  when (anyLargeJumps paramHistory) $
    alert "Large parameter jump detected — review signal inputs"