Signal Agent Evaluation Framework

Date: 2026-03-11 Context: Ben wants evals to validate signal agent quality before trusting them in the portfolio pipeline.

Evaluation Layers

We need evals at three distinct levels, each answering a different question:

Layer 1: Agent Extraction Accuracy

Question: Did the agent correctly extract/interpret the raw data?

This is the cheapest eval and catches the most dangerous failures (hallucinated data).

Eval	Method	Metric
Factual extraction	Give agent real EDGAR filings, verify it extracts correct transaction amounts, dates, persons	Precision/recall on structured fields
Numerical faithfulness	Compare agent-output signal values to deterministic Python computation on same data	Exact match rate (numbers should be identical)
Hallucination detection	Feed agent empty/null data, check it doesn’t invent signals	False positive rate on null inputs
Format compliance	Validate output JSON against SignalBundle schema	Parse success rate

Implementation:

# eval_extraction.py
# Golden set: 50 real EDGAR filings with hand-labeled fields
# Agent processes each, we compare output to ground truth

def eval_insider_extraction():
    golden = load_golden_set("edgar_form4_golden.json")
    for case in golden:
        agent_output = run_agent("insider_signal", case["filing"])
        assert_fields_match(agent_output, case["expected"], 
                          fields=["person", "transaction_type", "shares", "price"])
    
    # Null input test
    agent_output = run_agent("insider_signal", EMPTY_FILING)
    assert len(agent_output["signals"]) == 0, "Hallucinated signal from empty input"

Layer 2: Signal Predictive Power

Question: Do the signals actually predict future returns?

This is the core quant eval. Uses historical data with forward-looking returns.

Metric	Definition	Good	Bad
Information Coefficient (IC)	Rank correlation between signal score and subsequent N-day return	>0.03	<0.01
IC Information Ratio (ICIR)	mean(IC) / std(IC) across time periods	>0.5	<0.2
Hit Rate	% of periods where IC > 0	>55%	<50%
Signal Decay Curve	IC as function of horizon (1d, 5d, 20d, 60d)	Slow decay	Instant decay
Turnover-adjusted IC	IC net of transaction costs from signal changes	Still positive	Negative

Implementation:

# eval_predictive.py
# Walk-forward backtest: at each date t, generate signals, 
# measure correlation with returns at t+horizon

def eval_information_coefficient(signal_type, universe, start, end, horizon=20):
    """
    Walk-forward IC measurement.
    
    At each rebalancing date:
    1. Run signal agent on data available as of that date
    2. Record signal scores per asset
    3. Measure rank correlation with forward returns (t to t+horizon)
    4. Aggregate IC across all dates
    """
    ics = []
    for date in rebalance_dates(start, end, freq="weekly"):
        signals = run_agent_at_date(signal_type, universe, date)
        forward_returns = get_returns(universe, date, date + horizon)
        ic = spearmanr(signals, forward_returns)
        ics.append(ic)
    
    return {
        "mean_ic": np.mean(ics),
        "ic_std": np.std(ics),
        "icir": np.mean(ics) / np.std(ics),
        "hit_rate": np.mean([ic > 0 for ic in ics]),
        "n_periods": len(ics),
        "t_stat": np.mean(ics) / (np.std(ics) / np.sqrt(len(ics)))
    }

def eval_decay_curve(signal_type, universe, start, end):
    """IC at different horizons — reveals signal's natural timescale."""
    horizons = [1, 5, 10, 20, 40, 60]
    return {h: eval_information_coefficient(signal_type, universe, start, end, h) 
            for h in horizons}

Key design choice: walk-forward, not in-sample. We only ever measure signal quality on data the agent hasn’t seen. This is critical — backtesting with lookahead bias is the #1 way quant evals lie.

Layer 3: Portfolio-Level Performance

Question: Do the signal-driven parameter updates improve portfolio outcomes?

This evaluates the full pipeline end-to-end: signals → Bayesian update → Kelly → portfolio.

Metric	Definition	Good
Sharpe Ratio improvement	Sharpe(signal-adjusted portfolio) - Sharpe(static portfolio)	>0
Max drawdown change	Does signal integration reduce or increase worst drawdown?	Reduce
Turnover	How much does the portfolio trade due to signal changes?	<50% annual
Parameter stability	How much do μ/σ estimates jump between updates?	Smooth

Implementation:

# eval_portfolio.py
# Compare: (A) static Config.hs parameters vs (B) signal-adjusted parameters

def eval_portfolio_improvement(start, end):
    """
    Walk-forward portfolio simulation:
    A = baseline (static μ=0.15, σ=0.65 for BTC, etc.)
    B = signal-adjusted (Bayesian-updated μ, σ from agent)
    
    Both use same Kelly optimizer + same rebalancing frequency.
    Compare risk-adjusted returns.
    """
    portfolio_a = simulate_portfolio(static_params(), start, end)
    portfolio_b = simulate_portfolio(signal_params(), start, end)
    
    return {
        "sharpe_delta": portfolio_b.sharpe - portfolio_a.sharpe,
        "max_dd_delta": portfolio_b.max_drawdown - portfolio_a.max_drawdown,
        "return_delta": portfolio_b.total_return - portfolio_a.total_return,
        "turnover_b": portfolio_b.annual_turnover,
        "param_stability": np.std(portfolio_b.param_history),
    }

Eval Infrastructure Design

Golden Set Construction

We need curated test cases. Two approaches:

1. Historical replay (cheap, scalable):

Download historical EDGAR, FRED, price data for 2020-2025
At each date t, freeze “available data” = everything before t
Run agent, compare signal output to known forward returns
Pro: large sample size. Con: no ground truth for “correct interpretation”

2. Hand-labeled cases (expensive, precise):

Expert labels 50 EDGAR filings as “bullish signal” / “no signal” / “bearish”
Expert labels 20 macro snapshots as “expansion” / “contraction” / “transition”
Agent must match expert labels
Pro: tests interpretation quality. Con: small sample, labor-intensive

Recommendation: both. Historical replay for IC (layer 2). Hand-labeled for extraction accuracy (layer 1).

Anti-Overfitting Measures

Temporal separation: Train period (2020-2023) vs test period (2024-2025). Never evaluate on training data.
Multiple testing correction: If we test 10 signal types, apply Bonferroni or FDR correction. A t-stat of 2.0 on one signal out of 20 tested is not significant.
Out-of-distribution robustness: Test on 2020 (COVID crash), 2022 (rate hikes), 2024 (recovery). Signal must work across regimes, not just one market condition.
Paper trading period: After backtests pass, run the signal pipeline live for 1-3 months alongside static parameters. Compare predictions vs realized returns in real-time before committing capital.

Monitoring (Post-Deployment)

Once signals are live, continuous monitoring:

# monitor.py — runs weekly alongside signal pipeline

def weekly_report():
    """
    Track live signal quality.
    Alert if any metric degrades below threshold.
    """
    for signal_type in active_signals:
        recent_ic = compute_ic(signal_type, lookback="90d")
        if recent_ic["mean_ic"] < IC_THRESHOLD:
            alert(f"{signal_type} IC dropped to {recent_ic['mean_ic']:.3f}")
        if recent_ic["hit_rate"] < 0.50:
            alert(f"{signal_type} hit rate below 50% — signal may be dead")
    
    # Parameter drift check
    param_changes = load_param_history(lookback="30d")
    if any_large_jumps(param_changes):
        alert("Large parameter jump detected — review signal inputs")

Signal-Specific Eval Criteria

Insider Signal (EDGAR Form 4)

Extraction accuracy: Transaction type, shares, price, person role → 95%+ match
Known alpha: Literature shows insider purchases predict 3-5% excess returns over 60 days (Jeng et al. 2003, Lakonishok & Lee 2001)
Eval baseline: Does our agent’s insider signal achieve IC > 0.02 at 60-day horizon?
Null test: Companies with no insider activity should produce null signal

Macro Regime (FRED)

Extraction accuracy: Agent must correctly identify yield curve inversion, M2 trend direction, unemployment trajectory → compare to NBER recession dates
Known alpha: Yield curve inversion predicts recession with ~12-month lead (Estrella & Mishkin 1998)
Eval baseline: Regime classification accuracy vs NBER business cycle dating
Stress test: Present agent with ambiguous data (conflicting indicators) — should output low confidence, not fabricate a view

Price Momentum/Vol

Extraction accuracy: N/A — this is deterministic Python, not LLM
Known alpha: Cross-sectional momentum (Jegadeesh & Titman 1993): top decile outperforms bottom by ~12%/year
Eval baseline: IC > 0.03 at 20-day horizon for momentum signal
The real test: Does the LLM coherence assessment (cross-signal synthesis) add value beyond the raw deterministic signals? Measure IC with and without the coherence step.

Coherence/Synthesis Layer (the LLM-specific part)

This is the most important eval because it’s the only part where the LLM does something that can’t be done deterministically.

Value-add test: IC(combined_with_LLM) vs IC(combined_without_LLM). If the LLM’s cross-signal assessment doesn’t improve IC, remove it.
Confidence calibration: When the agent says confidence=0.8, it should be right ~80% of the time. Plot calibration curve.
Consistency: Run the same inputs through the agent 10 times. Signal scores should be stable (low variance). High variance = unreliable.
Reasoning quality: For each coherence assessment, the agent must explain its reasoning. Human reviews a sample for logical validity.

Phased Eval Plan

Phase 1 Evals (with Phase 1 code — no LLM)

Deterministic signal computation: compare Python outputs to reference values
Basic IC on price signals (momentum, vol, mean-reversion) using historical data
Validate JSON schema compliance
This is pure software testing — unit tests + integration tests

Phase 2 Evals (with Bayesian integration)

Black-Litterman math verification: known inputs → expected outputs
Portfolio sim: signal-adjusted vs static, walk-forward on 2020-2025 data
Parameter stability checks

Phase 3 Evals (agent enters)

Layer 1: extraction accuracy on golden set
Layer 2: IC measurement per agent signal type
Coherence layer value-add test
Confidence calibration
Consistency (multi-run variance)

Phase 4 Evals (live monitoring)

Weekly IC tracking with alert thresholds
Signal decay curve updates
Portfolio attribution (which signals drove returns?)
Quarterly full re-eval on expanding test set

Key Insight from AlphaEval Paper

The AlphaEval framework proposes five dimensions for evaluating alpha signals without full backtesting:

Predictive power (IC, RankIC)
Temporal stability (IC consistency over time)
Robustness (performance under market perturbations)
Financial logic (interpretability — does the signal make economic sense?)
Diversity (is this signal different from existing signals?)

Dimensions 4-5 are especially relevant for LLM-generated signals because LLMs tend to converge on the same factors (the “homogenization” problem from AlphaAgent). Our eval should check that each signal the agent produces is (a) economically motivated and (b) not just a restatement of another signal.

Minimum Viable Eval

If we do nothing else, do this:

Historical replay IC test for each signal type, 2020-2025, 20-day horizon
Null input test — agent produces no signals from empty/null data
Confidence calibration — plot predicted confidence vs realized accuracy
A/B portfolio sim — signal-adjusted vs static params, walk-forward

These four tests catch ~90% of potential failures:

IC test catches useless signals
Null test catches hallucinations
Calibration catches overconfident garbage
A/B sim catches negative portfolio impact even from “good” individual signals