Signal Agent Evaluation Framework
Date: 2026-03-11 Context: Ben wants evals to validate signal agent quality before trusting them in the portfolio pipeline.
Evaluation Layers
We need evals at three distinct levels, each answering a different question:
Layer 1: Agent Extraction Accuracy
Question: Did the agent correctly extract/interpret the raw data?
This is the cheapest eval and catches the most dangerous failures (hallucinated data).
| Eval | Method | Metric |
|---|---|---|
| Factual extraction | Give agent real EDGAR filings, verify it extracts correct transaction amounts, dates, persons | Precision/recall on structured fields |
| Numerical faithfulness | Compare agent-output signal values to deterministic Python computation on same data | Exact match rate (numbers should be identical) |
| Hallucination detection | Feed agent empty/null data, check it doesn’t invent signals | False positive rate on null inputs |
| Format compliance | Validate output JSON against SignalBundle schema | Parse success rate |
Implementation:
# eval_extraction.py
# Golden set: 50 real EDGAR filings with hand-labeled fields
# Agent processes each, we compare output to ground truth
def eval_insider_extraction():
golden = load_golden_set("edgar_form4_golden.json")
for case in golden:
agent_output = run_agent("insider_signal", case["filing"])
assert_fields_match(agent_output, case["expected"],
fields=["person", "transaction_type", "shares", "price"])
# Null input test
agent_output = run_agent("insider_signal", EMPTY_FILING)
assert len(agent_output["signals"]) == 0, "Hallucinated signal from empty input"
Layer 2: Signal Predictive Power
Question: Do the signals actually predict future returns?
This is the core quant eval. Uses historical data with forward-looking returns.
| Metric | Definition | Good | Bad |
|---|---|---|---|
| Information Coefficient (IC) | Rank correlation between signal score and subsequent N-day return | >0.03 | <0.01 |
| IC Information Ratio (ICIR) | mean(IC) / std(IC) across time periods | >0.5 | <0.2 |
| Hit Rate | % of periods where IC > 0 | >55% | <50% |
| Signal Decay Curve | IC as function of horizon (1d, 5d, 20d, 60d) | Slow decay | Instant decay |
| Turnover-adjusted IC | IC net of transaction costs from signal changes | Still positive | Negative |
Implementation:
# eval_predictive.py
# Walk-forward backtest: at each date t, generate signals,
# measure correlation with returns at t+horizon
def eval_information_coefficient(signal_type, universe, start, end, horizon=20):
"""
Walk-forward IC measurement.
At each rebalancing date:
1. Run signal agent on data available as of that date
2. Record signal scores per asset
3. Measure rank correlation with forward returns (t to t+horizon)
4. Aggregate IC across all dates
"""
ics = []
for date in rebalance_dates(start, end, freq="weekly"):
signals = run_agent_at_date(signal_type, universe, date)
forward_returns = get_returns(universe, date, date + horizon)
ic = spearmanr(signals, forward_returns)
ics.append(ic)
return {
"mean_ic": np.mean(ics),
"ic_std": np.std(ics),
"icir": np.mean(ics) / np.std(ics),
"hit_rate": np.mean([ic > 0 for ic in ics]),
"n_periods": len(ics),
"t_stat": np.mean(ics) / (np.std(ics) / np.sqrt(len(ics)))
}
def eval_decay_curve(signal_type, universe, start, end):
"""IC at different horizons — reveals signal's natural timescale."""
horizons = [1, 5, 10, 20, 40, 60]
return {h: eval_information_coefficient(signal_type, universe, start, end, h)
for h in horizons}
Key design choice: walk-forward, not in-sample. We only ever measure signal quality on data the agent hasn’t seen. This is critical — backtesting with lookahead bias is the #1 way quant evals lie.
Layer 3: Portfolio-Level Performance
Question: Do the signal-driven parameter updates improve portfolio outcomes?
This evaluates the full pipeline end-to-end: signals → Bayesian update → Kelly → portfolio.
| Metric | Definition | Good |
|---|---|---|
| Sharpe Ratio improvement | Sharpe(signal-adjusted portfolio) - Sharpe(static portfolio) | >0 |
| Max drawdown change | Does signal integration reduce or increase worst drawdown? | Reduce |
| Turnover | How much does the portfolio trade due to signal changes? | <50% annual |
| Parameter stability | How much do μ/σ estimates jump between updates? | Smooth |
Implementation:
# eval_portfolio.py
# Compare: (A) static Config.hs parameters vs (B) signal-adjusted parameters
def eval_portfolio_improvement(start, end):
"""
Walk-forward portfolio simulation:
A = baseline (static μ=0.15, σ=0.65 for BTC, etc.)
B = signal-adjusted (Bayesian-updated μ, σ from agent)
Both use same Kelly optimizer + same rebalancing frequency.
Compare risk-adjusted returns.
"""
portfolio_a = simulate_portfolio(static_params(), start, end)
portfolio_b = simulate_portfolio(signal_params(), start, end)
return {
"sharpe_delta": portfolio_b.sharpe - portfolio_a.sharpe,
"max_dd_delta": portfolio_b.max_drawdown - portfolio_a.max_drawdown,
"return_delta": portfolio_b.total_return - portfolio_a.total_return,
"turnover_b": portfolio_b.annual_turnover,
"param_stability": np.std(portfolio_b.param_history),
}
Eval Infrastructure Design
Golden Set Construction
We need curated test cases. Two approaches:
1. Historical replay (cheap, scalable):
- Download historical EDGAR, FRED, price data for 2020-2025
- At each date t, freeze “available data” = everything before t
- Run agent, compare signal output to known forward returns
- Pro: large sample size. Con: no ground truth for “correct interpretation”
2. Hand-labeled cases (expensive, precise):
- Expert labels 50 EDGAR filings as “bullish signal” / “no signal” / “bearish”
- Expert labels 20 macro snapshots as “expansion” / “contraction” / “transition”
- Agent must match expert labels
- Pro: tests interpretation quality. Con: small sample, labor-intensive
Recommendation: both. Historical replay for IC (layer 2). Hand-labeled for extraction accuracy (layer 1).
Anti-Overfitting Measures
- Temporal separation: Train period (2020-2023) vs test period (2024-2025). Never evaluate on training data.
- Multiple testing correction: If we test 10 signal types, apply Bonferroni or FDR correction. A t-stat of 2.0 on one signal out of 20 tested is not significant.
- Out-of-distribution robustness: Test on 2020 (COVID crash), 2022 (rate hikes), 2024 (recovery). Signal must work across regimes, not just one market condition.
- Paper trading period: After backtests pass, run the signal pipeline live for 1-3 months alongside static parameters. Compare predictions vs realized returns in real-time before committing capital.
Monitoring (Post-Deployment)
Once signals are live, continuous monitoring:
# monitor.py — runs weekly alongside signal pipeline
def weekly_report():
"""
Track live signal quality.
Alert if any metric degrades below threshold.
"""
for signal_type in active_signals:
recent_ic = compute_ic(signal_type, lookback="90d")
if recent_ic["mean_ic"] < IC_THRESHOLD:
alert(f"{signal_type} IC dropped to {recent_ic['mean_ic']:.3f}")
if recent_ic["hit_rate"] < 0.50:
alert(f"{signal_type} hit rate below 50% — signal may be dead")
# Parameter drift check
param_changes = load_param_history(lookback="30d")
if any_large_jumps(param_changes):
alert("Large parameter jump detected — review signal inputs")
Signal-Specific Eval Criteria
Insider Signal (EDGAR Form 4)
- Extraction accuracy: Transaction type, shares, price, person role → 95%+ match
- Known alpha: Literature shows insider purchases predict 3-5% excess returns over 60 days (Jeng et al. 2003, Lakonishok & Lee 2001)
- Eval baseline: Does our agent’s insider signal achieve IC > 0.02 at 60-day horizon?
- Null test: Companies with no insider activity should produce null signal
Macro Regime (FRED)
- Extraction accuracy: Agent must correctly identify yield curve inversion, M2 trend direction, unemployment trajectory → compare to NBER recession dates
- Known alpha: Yield curve inversion predicts recession with ~12-month lead (Estrella & Mishkin 1998)
- Eval baseline: Regime classification accuracy vs NBER business cycle dating
- Stress test: Present agent with ambiguous data (conflicting indicators) — should output low confidence, not fabricate a view
Price Momentum/Vol
- Extraction accuracy: N/A — this is deterministic Python, not LLM
- Known alpha: Cross-sectional momentum (Jegadeesh & Titman 1993): top decile outperforms bottom by ~12%/year
- Eval baseline: IC > 0.03 at 20-day horizon for momentum signal
- The real test: Does the LLM coherence assessment (cross-signal synthesis) add value beyond the raw deterministic signals? Measure IC with and without the coherence step.
Coherence/Synthesis Layer (the LLM-specific part)
This is the most important eval because it’s the only part where the LLM does something that can’t be done deterministically.
- Value-add test: IC(combined_with_LLM) vs IC(combined_without_LLM). If the LLM’s cross-signal assessment doesn’t improve IC, remove it.
- Confidence calibration: When the agent says confidence=0.8, it should be right ~80% of the time. Plot calibration curve.
- Consistency: Run the same inputs through the agent 10 times. Signal scores should be stable (low variance). High variance = unreliable.
- Reasoning quality: For each coherence assessment, the agent must explain its reasoning. Human reviews a sample for logical validity.
Phased Eval Plan
Phase 1 Evals (with Phase 1 code — no LLM)
- Deterministic signal computation: compare Python outputs to reference values
- Basic IC on price signals (momentum, vol, mean-reversion) using historical data
- Validate JSON schema compliance
- This is pure software testing — unit tests + integration tests
Phase 2 Evals (with Bayesian integration)
- Black-Litterman math verification: known inputs → expected outputs
- Portfolio sim: signal-adjusted vs static, walk-forward on 2020-2025 data
- Parameter stability checks
Phase 3 Evals (agent enters)
- Layer 1: extraction accuracy on golden set
- Layer 2: IC measurement per agent signal type
- Coherence layer value-add test
- Confidence calibration
- Consistency (multi-run variance)
Phase 4 Evals (live monitoring)
- Weekly IC tracking with alert thresholds
- Signal decay curve updates
- Portfolio attribution (which signals drove returns?)
- Quarterly full re-eval on expanding test set
Key Insight from AlphaEval Paper
The AlphaEval framework proposes five dimensions for evaluating alpha signals without full backtesting:
- Predictive power (IC, RankIC)
- Temporal stability (IC consistency over time)
- Robustness (performance under market perturbations)
- Financial logic (interpretability — does the signal make economic sense?)
- Diversity (is this signal different from existing signals?)
Dimensions 4-5 are especially relevant for LLM-generated signals because LLMs tend to converge on the same factors (the “homogenization” problem from AlphaAgent). Our eval should check that each signal the agent produces is (a) economically motivated and (b) not just a restatement of another signal.
Minimum Viable Eval
If we do nothing else, do this:
- Historical replay IC test for each signal type, 2020-2025, 20-day horizon
- Null input test — agent produces no signals from empty/null data
- Confidence calibration — plot predicted confidence vs realized accuracy
- A/B portfolio sim — signal-adjusted vs static params, walk-forward
These four tests catch ~90% of potential failures:
- IC test catches useless signals
- Null test catches hallucinations
- Calibration catches overconfident garbage
- A/B sim catches negative portfolio impact even from “good” individual signals