Signal Agent Evaluation Framework (Haskell)
Date: 2026-03-11 Context: All-Haskell eval framework for the speculative fund quant pipeline.
1. Design
The eval framework lives in Omni.Fund.Quant.Eval. It’s pure Haskell — no Python
anywhere. Uses the same data libraries (Data.Market, Data.Edgar, Data.Fred)
as the production signal pipeline, so evals run on the exact same code path.
The eval is a first-class part of the system, not a side script. Every signal must pass eval before it influences portfolio weights.
2. Core Eval Types
module Omni.Fund.Quant.Eval where
import Numeric.LinearAlgebra (Vector, Matrix)
-- | Result of evaluating a single signal type over a historical period.
data SignalEval = SignalEval
{ seName :: Text -- signal type name
, sePeriod :: (Day, Day) -- eval period
, seIC :: Double -- mean Information Coefficient
, seICStd :: Double -- IC standard deviation
, seICIR :: Double -- IC / std(IC) — information ratio
, seHitRate :: Double -- % of periods where IC > 0
, seTStat :: Double -- t-statistic: IC / (std / sqrt(n))
, seNPeriods :: Int -- number of rebalance periods evaluated
, seDecayCurve :: [(Int, Double)] -- (horizon_days, IC_at_horizon)
}
-- | Result of portfolio-level backtest.
data PortfolioEval = PortfolioEval
{ peStrategy :: Text
, pePeriod :: (Day, Day)
, peTotalReturn :: Double
, peAnnualizedReturn :: Double
, peSharpeRatio :: Double
, peMaxDrawdown :: Double
, peAnnualTurnover :: Double
, peWinRate :: Double -- % of periods with positive return
, peCalmarRatio :: Double -- annualized return / max drawdown
}
-- | Walk-forward configuration.
data EvalConfig = EvalConfig
{ ecUniverse :: [Asset]
, ecTrainStart :: Day -- e.g., 2020-01-01
, ecTrainEnd :: Day -- e.g., 2023-12-31
, ecTestStart :: Day -- e.g., 2024-01-01
, ecTestEnd :: Day -- e.g., 2025-12-31
, ecRebalanceFreq :: Int -- days between rebalances (5 = weekly)
, ecHorizons :: [Int] -- forward return horizons to test
, ecMinIC :: Double -- minimum IC to pass (0.02)
, ecMinHitRate :: Double -- minimum hit rate to pass (0.50)
, ecMinTStat :: Double -- minimum t-stat to pass (2.0)
}
defaultEvalConfig :: EvalConfig
defaultEvalConfig = EvalConfig
{ ecUniverse = sectorETFs ++ crossAssets
, ecTrainStart = fromGregorian 2020 1 1
, ecTrainEnd = fromGregorian 2023 12 31
, ecTestStart = fromGregorian 2024 1 1
, ecTestEnd = fromGregorian 2025 12 31
, ecRebalanceFreq = 5
, ecHorizons = [1, 5, 10, 20, 40, 60]
, ecMinIC = 0.02
, ecMinHitRate = 0.50
, ecMinTStat = 2.0
}
3. Layer 1: Extraction Accuracy
Tests that data libraries return correct values and signal functions compute correctly on known inputs.
-- | Unit tests for deterministic signal computations.
-- Uses known price histories with pre-computed expected values.
-- Golden test: momentum signal on synthetic data
test_momentumSignal :: Test
test_momentumSignal =
let prices = [100, 102, 105, 103, 108, 110, 112, 115] -- ~15% return
expected = trailingReturn 8 (toDailyBars prices)
actual = 0.15 -- (115 - 100) / 100
in assertApproxEqual "momentum return" 0.01 expected actual
-- Golden test: z-score mean reversion
test_meanReversionZ :: Test
test_meanReversionZ =
let -- Price at 2σ above 200-day SMA
prices = replicate 200 100 ++ [120]
z = meanReversionZ 200 (toDailyBars prices)
in assertApproxEqual "z-score" 0.1 z 2.0 -- should be ~2σ
-- Golden test: correlation matrix
test_correlationMatrix :: Test
test_correlationMatrix =
let -- Perfectly correlated series
seriesA = [1, 2, 3, 4, 5] :: [Double]
seriesB = [2, 4, 6, 8, 10] :: [Double]
corr = correlationMatrix [toDailyBars seriesA, toDailyBars seriesB]
in assertApproxEqual "perfect correlation" 0.01 (corr ! (0, 1)) 1.0
-- Null input tests
test_nullSignal :: Test
test_nullSignal =
let signals = momentumSignal 20 Map.empty
in assertEqual "no signals from empty data" 0 (Map.size signals)
-- EDGAR parsing tests (against known filings)
test_edgarParsing :: Test
test_edgarParsing = do
let goldenFiling = loadGolden "test/golden/form4-sample.json"
parsed <- Edgar.parseForm4 goldenFiling
assertEqual "filer name" "Tim Cook" (itFiler parsed)
assertEqual "tx type" Purchase (itTxType parsed)
assertEqual "shares" 100000 (itShares parsed)
QuickCheck properties
-- Signal values are always in valid range
prop_signalBounded :: [Double] -> Property
prop_signalBounded prices =
not (null prices) ==>
let z = meanReversionZ 20 (toDailyBars prices)
in z >= -10 && z <= 10 -- z-scores beyond 10 indicate a bug
-- Correlation matrix is symmetric and diagonal = 1
prop_corrSymmetric :: [[Double]] -> Property
prop_corrSymmetric series =
length series >= 2 && all ((>= 2) . length) series ==>
let m = correlationMatrix (map toDailyBars series)
n = rows m
in m == tr m -- symmetric
&& all (\i -> abs (m ! (i, i) - 1.0) < 0.001) [0..n-1] -- diag = 1
-- Kelly weights sum to <= 1 (no leverage)
prop_kellyNoLeverage :: Vector Double -> Matrix Double -> Property
prop_kellyNoLeverage mu sigma =
isPositiveDefinite sigma ==>
let f = kellyOptimalCorrelated 0.04 mu sigma
in sumElements f <= 1.0 + 1e-10
-- Black-Litterman posterior is between prior and views
prop_blPosteriorBounded :: Property
prop_blPosteriorBounded = forAll genBLInputs $ \(mu, sigma, tau, p, q, omega) ->
let (mu', _) = blUpdate mu sigma tau p q omega
in all (\i -> mu' ! i >= min (mu ! i) (q ! 0) - 0.1
&& mu' ! i <= max (mu ! i) (q ! 0) + 0.1) [0..dim mu - 1]
4. Layer 2: Signal Predictive Power (Walk-Forward IC)
This is the core eval. Pure Haskell implementation.
-- | Spearman rank correlation between two vectors.
spearmanCorrelation :: [Double] -> [Double] -> Double
spearmanCorrelation xs ys =
let n = fromIntegral (length xs)
rankX = rank xs
rankY = rank ys
diffs = zipWith (-) rankX rankY
d2 = sum (map (^2) diffs)
in 1 - (6 * d2) / (n * (n^2 - 1))
-- | Rank a list (handling ties with average rank).
rank :: [Double] -> [Double]
rank xs = ...
-- | Measure Information Coefficient for a signal over a historical period.
--
-- Walk-forward: at each rebalance date, compute signal scores using only
-- data available as of that date, then correlate with forward returns.
evalSignalIC
:: EvalConfig
-> (Map Text [DailyBar] -> Map Text Signal) -- signal function
-> Map Text [DailyBar] -- full price history
-> SignalEval
evalSignalIC config signalFn allPrices =
let rebalDates = generateRebalDates (ecTestStart config) (ecTestEnd config) (ecRebalanceFreq config)
-- For each rebalance date, compute IC
ics = map (\date ->
let -- Slice prices up to this date (no lookahead)
pricesAsOf = Map.map (filter (\b -> dbDate b <= date)) allPrices
-- Compute signals
signals = signalFn pricesAsOf
-- Get forward returns (date to date + horizon)
fwdReturns = Map.map (forwardReturn date 20) allPrices
-- Align: only assets with both signal and return
aligned = Map.intersectionWith (,) (Map.map sigValue signals) fwdReturns
(sigVals, retVals) = unzip (Map.elems aligned)
in if length sigVals >= 5
then Just (spearmanCorrelation sigVals retVals)
else Nothing
) rebalDates
validICs = catMaybes ics
n = length validICs
meanIC = if n > 0 then sum validICs / fromIntegral n else 0
stdIC = standardDeviation validICs
icir = if stdIC > 0 then meanIC / stdIC else 0
hitRate = fromIntegral (length (filter (> 0) validICs)) / fromIntegral (max 1 n)
tstat = if stdIC > 0 then meanIC / (stdIC / sqrt (fromIntegral n)) else 0
-- Decay curve: IC at multiple horizons
decay = map (\h ->
let ics' = catMaybes $ map (\date ->
let pricesAsOf = Map.map (filter (\b -> dbDate b <= date)) allPrices
signals = signalFn pricesAsOf
fwdReturns = Map.map (forwardReturn date h) allPrices
aligned = Map.intersectionWith (,) (Map.map sigValue signals) fwdReturns
(sv, rv) = unzip (Map.elems aligned)
in if length sv >= 5 then Just (spearmanCorrelation sv rv) else Nothing
) rebalDates
in (h, if null ics' then 0 else sum ics' / fromIntegral (length ics'))
) (ecHorizons config)
in SignalEval
{ seName = "signal"
, sePeriod = (ecTestStart config, ecTestEnd config)
, seIC = meanIC
, seICStd = stdIC
, seICIR = icir
, seHitRate = hitRate
, seTStat = tstat
, seNPeriods = n
, seDecayCurve = decay
}
-- | Check if a signal passes minimum quality thresholds.
signalPasses :: EvalConfig -> SignalEval -> Bool
signalPasses config eval =
seIC eval >= ecMinIC config
&& seHitRate eval >= ecMinHitRate config
&& seTStat eval >= ecMinTStat config
5. Layer 3: Portfolio-Level Backtest
-- | Walk-forward portfolio simulation.
-- Compares signal-driven strategy to equal-weight benchmark.
evalPortfolio
:: EvalConfig
-> (Map Text [DailyBar] -> Map Text Signal) -- signal function
-> Map Text [DailyBar] -- full price history
-> (PortfolioEval, PortfolioEval) -- (strategy, benchmark)
evalPortfolio config signalFn allPrices =
let rebalDates = generateRebalDates (ecTestStart config) (ecTestEnd config) (ecRebalanceFreq config)
-- Strategy: signal → alpha → Kelly weights
strategyReturns = walkForwardReturns rebalDates allPrices $ \date pricesAsOf ->
let signals = signalFn pricesAsOf
alphaScores = combineSignals defaultSignalWeights (Map.elems signals)
mu = toVector alphaScores
sigma = correlationMatrix (Map.elems pricesAsOf) -- realized cov
weights = kellyOptimalCorrelated 0.04 mu sigma
in weights
-- Benchmark: equal weight across tradeable universe
benchmarkReturns = walkForwardReturns rebalDates allPrices $ \_ _ ->
let n = length (filter assetTradeable (ecUniverse config))
in fromList (replicate n (1.0 / fromIntegral n))
stratEval = computePortfolioMetrics "signal-driven" (ecTestStart config, ecTestEnd config) strategyReturns
benchEval = computePortfolioMetrics "equal-weight" (ecTestStart config, ecTestEnd config) benchmarkReturns
in (stratEval, benchEval)
-- | Walk forward: at each rebalance date, compute weights, hold until next date,
-- measure return.
walkForwardReturns
:: [Day] -- rebalance dates
-> Map Text [DailyBar] -- price history
-> (Day -> Map Text [DailyBar] -> Vector Double) -- weight function
-> [(Day, Double)] -- (date, period return)
walkForwardReturns dates prices weightFn =
zipWith (\d1 d2 ->
let pricesAsOf = Map.map (filter (\b -> dbDate b <= d1)) prices
weights = weightFn d1 pricesAsOf
-- Period return = weighted sum of individual asset returns
assetReturns = Map.map (periodReturn d1 d2) prices
portfolioReturn = dot weights (toVector assetReturns)
in (d1, portfolioReturn)
) dates (tail dates)
-- | Compute standard portfolio metrics from a return series.
computePortfolioMetrics :: Text -> (Day, Day) -> [(Day, Double)] -> PortfolioEval
computePortfolioMetrics name period returns =
let rets = map snd returns
cumulative = scanl1 (*) (map (+ 1) rets)
totalReturn = last cumulative - 1
years = fromIntegral (diffDays (snd period) (fst period)) / 365.25
annualized = (1 + totalReturn) ** (1 / years) - 1
sharpe = mean rets / standardDeviation rets * sqrt 252
maxDD = maxDrawdown cumulative
winRate = fromIntegral (length (filter (> 0) rets)) / fromIntegral (length rets)
calmar = if maxDD > 0 then annualized / maxDD else 0
in PortfolioEval
{ peStrategy = name
, pePeriod = period
, peTotalReturn = totalReturn
, peAnnualizedReturn = annualized
, peSharpeRatio = sharpe
, peMaxDrawdown = maxDD
, peAnnualTurnover = 0 -- TODO: track from weight changes
, peWinRate = winRate
, peCalmarRatio = calmar
}
maxDrawdown :: [Double] -> Double
maxDrawdown cumReturns =
let peaks = scanl1 max cumReturns
drawdowns = zipWith (\peak val -> (peak - val) / peak) peaks cumReturns
in maximum drawdowns
6. Agent-Specific Evals (Phase 3)
These only apply when LLM agents are involved (insider + macro agents).
6.1 Hallucination Detection
-- | Feed the agent empty/null data. It must produce zero signals.
evalNullInput :: Op () -> IO Bool
evalNullInput agent = do
result <- runOp agent emptyContext
let signals = extractSignals result
pure (null signals)
-- | Feed the agent contradictory data. It must produce low confidence.
evalContradictoryInput :: Op () -> IO Bool
evalContradictoryInput agent = do
result <- runOp agent contradictoryContext
let signals = extractSignals result
pure (all (\s -> sigConfidence s < 0.3) signals)
6.2 Confidence Calibration
-- | Calibration curve: group signals by confidence bucket, measure actual accuracy.
-- A well-calibrated agent's 0.8-confidence signals should be correct ~80% of the time.
evalCalibration
:: [(Signal, Double)] -- (signal, actual forward return)
-> [(Double, Double, Int)] -- (predicted_confidence, actual_accuracy, n)
evalCalibration signalReturns =
let buckets = [0.0, 0.1 .. 1.0]
grouped = map (\lo ->
let hi = lo + 0.1
inBucket = filter (\(s, _) -> sigConfidence s >= lo && sigConfidence s < hi) signalReturns
n = length inBucket
correct = length (filter (\(s, r) -> signum (sigValue s) == signum r) inBucket)
accuracy = if n > 0 then fromIntegral correct / fromIntegral n else 0
in (lo + 0.05, accuracy, n)
) (init buckets)
in grouped
-- | Calibration error: mean absolute difference between confidence and accuracy.
calibrationError :: [(Double, Double, Int)] -> Double
calibrationError curve =
let weighted = [(abs (pred - actual)) * fromIntegral n | (pred, actual, n) <- curve, n > 0]
totalN = sum [n | (_, _, n) <- curve]
in sum weighted / fromIntegral (max 1 totalN)
6.3 Consistency (Multi-Run Variance)
-- | Run the agent N times on the same input. Signal values should be stable.
evalConsistency :: Int -> Op () -> Context -> IO Double
evalConsistency n agent ctx = do
results <- replicateM n (runOp agent ctx)
let signalSets = map extractSignals results
-- For each asset, compute variance of signal values across runs
variances = map (variance . map sigValue) (transpose signalSets)
pure (mean variances)
6.4 Value-Add Test
-- | Does the LLM agent add value beyond deterministic signals?
-- Compare IC(deterministic + agent) vs IC(deterministic only).
evalAgentValueAdd
:: EvalConfig
-> (Map Text [DailyBar] -> Map Text Signal) -- deterministic signals only
-> (Map Text [DailyBar] -> IO (Map Text Signal)) -- deterministic + agent
-> Map Text [DailyBar] -- price history
-> IO (SignalEval, SignalEval, Double) -- (without, with, IC delta)
evalAgentValueAdd config deterministicFn agentFn prices = do
let evalDet = evalSignalIC config deterministicFn prices
evalAgent <- evalSignalICWithIO config agentFn prices
pure (evalDet, evalAgent, seIC evalAgent - seIC evalDet)
7. Eval Runner
-- | Run the full eval suite and print results.
runEvalSuite :: EvalConfig -> IO ()
runEvalSuite config = do
putStrLn "=== Signal Agent Eval Suite ==="
-- 1. Load historical data
putStrLn "Loading historical price data..."
prices <- loadHistoricalPrices config
-- 2. Layer 1: Unit tests
putStrLn "\n--- Layer 1: Extraction Accuracy ---"
runUnitTests
-- 3. Layer 2: IC per signal type
putStrLn "\n--- Layer 2: Signal Predictive Power ---"
let signals = [("momentum", momentumSignal 63), ("mean_rev", meanReversionSignal 200),
("vol_regime", volRegimeSignal)]
forM_ signals $ \(name, fn) -> do
let eval = evalSignalIC config fn prices
putStrLn $ name ++ ": IC=" ++ show (seIC eval)
++ " ICIR=" ++ show (seICIR eval)
++ " hit=" ++ show (seHitRate eval)
++ " t=" ++ show (seTStat eval)
++ (if signalPasses config eval then " ✓" else " ✗ FAIL")
-- 4. Layer 3: Portfolio backtest
putStrLn "\n--- Layer 3: Portfolio Performance ---"
let bestSignal = momentumSignal 63 -- or whichever passed
let (strat, bench) = evalPortfolio config bestSignal prices
putStrLn $ "Strategy: Sharpe=" ++ show (peSharpeRatio strat)
++ " Return=" ++ show (peAnnualizedReturn strat)
++ " MaxDD=" ++ show (peMaxDrawdown strat)
putStrLn $ "Benchmark: Sharpe=" ++ show (peSharpeRatio bench)
++ " Return=" ++ show (peAnnualizedReturn bench)
++ " MaxDD=" ++ show (peMaxDrawdown bench)
putStrLn $ "Alpha: " ++ show (peAnnualizedReturn strat - peAnnualizedReturn bench)
-- 5. Summary
putStrLn "\n=== Summary ==="
putStrLn $ if peSharpeRatio strat > peSharpeRatio bench
then "PASS: Strategy outperforms benchmark"
else "FAIL: Strategy underperforms benchmark"
8. Minimum Viable Eval (do at least these 4)
-
Walk-forward IC for momentum signal on sector ETFs, 2020-2025 data, 20-day horizon. Must achieve IC > 0.02, t-stat > 2.0.
-
Null input test for every signal function. Empty data → zero signals.
-
Portfolio A/B sim on held-out data (2024-2025). Signal-driven Kelly vs equal-weight. Strategy must beat benchmark on Sharpe ratio.
-
Property tests (QuickCheck): Kelly weights sum ≤ 1, correlation matrix is symmetric with unit diagonal, BL posterior is between prior and views.
If all four pass, Phase 1 signals are validated and we proceed to Phase 2. If any fail, the signal is rejected and we debug before continuing.
9. Anti-Overfitting
- Strict temporal separation: Train on 2020-2023. Test on 2024-2025. Never cross.
- Multiple testing correction: If testing N signal variants, require t-stat > 2.0 + log(N). For 5 signal types: t > 2.6.
- Regime robustness: IC must be positive in both 2020-2021 (bull/crash) and 2022-2023 (bear/recovery). A signal that only works in one regime is fragile.
- No parameter tuning on test set. All hyperparameters (lookback windows, signal weights) are fixed before the test period. We don’t peek.
- Paper trading period: After backtests pass, run live for 1-3 months before committing capital. Compare live predictions to realized returns.
10. Live Monitoring (Phase 4)
-- | Weekly monitoring job. Runs every Monday.
weeklyMonitor :: IO ()
weeklyMonitor = do
-- Load last 90 days of signals and returns
signals <- loadRecentSignals 90
returns <- loadRecentReturns 90
-- Compute rolling IC per signal type
let rollingIC = Map.map (\sigs ->
let aligned = alignWithReturns sigs returns
in spearmanCorrelation (map sigValue aligned) (map snd aligned)
) (groupByType signals)
-- Alert on degraded signals
forM_ (Map.toList rollingIC) $ \(sigType, ic) ->
when (ic < 0.01) $
alert $ "WARNING: " ++ show sigType ++ " IC dropped to " ++ show ic
-- Parameter drift check
let paramHistory = loadParamHistory 30
when (anyLargeJumps paramHistory) $
alert "Large parameter jump detected — review signal inputs"