Signal Agent Evaluation Framework

Date: 2026-03-11 Context: Ben wants evals to validate signal agent quality before trusting them in the portfolio pipeline.


Evaluation Layers

We need evals at three distinct levels, each answering a different question:

Layer 1: Agent Extraction Accuracy

Question: Did the agent correctly extract/interpret the raw data?

This is the cheapest eval and catches the most dangerous failures (hallucinated data).

Eval Method Metric
Factual extraction Give agent real EDGAR filings, verify it extracts correct transaction amounts, dates, persons Precision/recall on structured fields
Numerical faithfulness Compare agent-output signal values to deterministic Python computation on same data Exact match rate (numbers should be identical)
Hallucination detection Feed agent empty/null data, check it doesn’t invent signals False positive rate on null inputs
Format compliance Validate output JSON against SignalBundle schema Parse success rate

Implementation:

# eval_extraction.py
# Golden set: 50 real EDGAR filings with hand-labeled fields
# Agent processes each, we compare output to ground truth

def eval_insider_extraction():
    golden = load_golden_set("edgar_form4_golden.json")
    for case in golden:
        agent_output = run_agent("insider_signal", case["filing"])
        assert_fields_match(agent_output, case["expected"], 
                          fields=["person", "transaction_type", "shares", "price"])
    
    # Null input test
    agent_output = run_agent("insider_signal", EMPTY_FILING)
    assert len(agent_output["signals"]) == 0, "Hallucinated signal from empty input"

Layer 2: Signal Predictive Power

Question: Do the signals actually predict future returns?

This is the core quant eval. Uses historical data with forward-looking returns.

Metric Definition Good Bad
Information Coefficient (IC) Rank correlation between signal score and subsequent N-day return >0.03 <0.01
IC Information Ratio (ICIR) mean(IC) / std(IC) across time periods >0.5 <0.2
Hit Rate % of periods where IC > 0 >55% <50%
Signal Decay Curve IC as function of horizon (1d, 5d, 20d, 60d) Slow decay Instant decay
Turnover-adjusted IC IC net of transaction costs from signal changes Still positive Negative

Implementation:

# eval_predictive.py
# Walk-forward backtest: at each date t, generate signals, 
# measure correlation with returns at t+horizon

def eval_information_coefficient(signal_type, universe, start, end, horizon=20):
    """
    Walk-forward IC measurement.
    
    At each rebalancing date:
    1. Run signal agent on data available as of that date
    2. Record signal scores per asset
    3. Measure rank correlation with forward returns (t to t+horizon)
    4. Aggregate IC across all dates
    """
    ics = []
    for date in rebalance_dates(start, end, freq="weekly"):
        signals = run_agent_at_date(signal_type, universe, date)
        forward_returns = get_returns(universe, date, date + horizon)
        ic = spearmanr(signals, forward_returns)
        ics.append(ic)
    
    return {
        "mean_ic": np.mean(ics),
        "ic_std": np.std(ics),
        "icir": np.mean(ics) / np.std(ics),
        "hit_rate": np.mean([ic > 0 for ic in ics]),
        "n_periods": len(ics),
        "t_stat": np.mean(ics) / (np.std(ics) / np.sqrt(len(ics)))
    }

def eval_decay_curve(signal_type, universe, start, end):
    """IC at different horizons — reveals signal's natural timescale."""
    horizons = [1, 5, 10, 20, 40, 60]
    return {h: eval_information_coefficient(signal_type, universe, start, end, h) 
            for h in horizons}

Key design choice: walk-forward, not in-sample. We only ever measure signal quality on data the agent hasn’t seen. This is critical — backtesting with lookahead bias is the #1 way quant evals lie.

Layer 3: Portfolio-Level Performance

Question: Do the signal-driven parameter updates improve portfolio outcomes?

This evaluates the full pipeline end-to-end: signals → Bayesian update → Kelly → portfolio.

Metric Definition Good
Sharpe Ratio improvement Sharpe(signal-adjusted portfolio) - Sharpe(static portfolio) >0
Max drawdown change Does signal integration reduce or increase worst drawdown? Reduce
Turnover How much does the portfolio trade due to signal changes? <50% annual
Parameter stability How much do μ/σ estimates jump between updates? Smooth

Implementation:

# eval_portfolio.py
# Compare: (A) static Config.hs parameters vs (B) signal-adjusted parameters

def eval_portfolio_improvement(start, end):
    """
    Walk-forward portfolio simulation:
    A = baseline (static μ=0.15, σ=0.65 for BTC, etc.)
    B = signal-adjusted (Bayesian-updated μ, σ from agent)
    
    Both use same Kelly optimizer + same rebalancing frequency.
    Compare risk-adjusted returns.
    """
    portfolio_a = simulate_portfolio(static_params(), start, end)
    portfolio_b = simulate_portfolio(signal_params(), start, end)
    
    return {
        "sharpe_delta": portfolio_b.sharpe - portfolio_a.sharpe,
        "max_dd_delta": portfolio_b.max_drawdown - portfolio_a.max_drawdown,
        "return_delta": portfolio_b.total_return - portfolio_a.total_return,
        "turnover_b": portfolio_b.annual_turnover,
        "param_stability": np.std(portfolio_b.param_history),
    }

Eval Infrastructure Design

Golden Set Construction

We need curated test cases. Two approaches:

1. Historical replay (cheap, scalable):

2. Hand-labeled cases (expensive, precise):

Recommendation: both. Historical replay for IC (layer 2). Hand-labeled for extraction accuracy (layer 1).

Anti-Overfitting Measures

  1. Temporal separation: Train period (2020-2023) vs test period (2024-2025). Never evaluate on training data.
  2. Multiple testing correction: If we test 10 signal types, apply Bonferroni or FDR correction. A t-stat of 2.0 on one signal out of 20 tested is not significant.
  3. Out-of-distribution robustness: Test on 2020 (COVID crash), 2022 (rate hikes), 2024 (recovery). Signal must work across regimes, not just one market condition.
  4. Paper trading period: After backtests pass, run the signal pipeline live for 1-3 months alongside static parameters. Compare predictions vs realized returns in real-time before committing capital.

Monitoring (Post-Deployment)

Once signals are live, continuous monitoring:

# monitor.py — runs weekly alongside signal pipeline

def weekly_report():
    """
    Track live signal quality.
    Alert if any metric degrades below threshold.
    """
    for signal_type in active_signals:
        recent_ic = compute_ic(signal_type, lookback="90d")
        if recent_ic["mean_ic"] < IC_THRESHOLD:
            alert(f"{signal_type} IC dropped to {recent_ic['mean_ic']:.3f}")
        if recent_ic["hit_rate"] < 0.50:
            alert(f"{signal_type} hit rate below 50% — signal may be dead")
    
    # Parameter drift check
    param_changes = load_param_history(lookback="30d")
    if any_large_jumps(param_changes):
        alert("Large parameter jump detected — review signal inputs")

Signal-Specific Eval Criteria

Insider Signal (EDGAR Form 4)

Macro Regime (FRED)

Price Momentum/Vol

Coherence/Synthesis Layer (the LLM-specific part)

This is the most important eval because it’s the only part where the LLM does something that can’t be done deterministically.


Phased Eval Plan

Phase 1 Evals (with Phase 1 code — no LLM)

Phase 2 Evals (with Bayesian integration)

Phase 3 Evals (agent enters)

Phase 4 Evals (live monitoring)


Key Insight from AlphaEval Paper

The AlphaEval framework proposes five dimensions for evaluating alpha signals without full backtesting:

  1. Predictive power (IC, RankIC)
  2. Temporal stability (IC consistency over time)
  3. Robustness (performance under market perturbations)
  4. Financial logic (interpretability — does the signal make economic sense?)
  5. Diversity (is this signal different from existing signals?)

Dimensions 4-5 are especially relevant for LLM-generated signals because LLMs tend to converge on the same factors (the “homogenization” problem from AlphaAgent). Our eval should check that each signal the agent produces is (a) economically motivated and (b) not just a restatement of another signal.


Minimum Viable Eval

If we do nothing else, do this:

  1. Historical replay IC test for each signal type, 2020-2025, 20-day horizon
  2. Null input test — agent produces no signals from empty/null data
  3. Confidence calibration — plot predicted confidence vs realized accuracy
  4. A/B portfolio sim — signal-adjusted vs static params, walk-forward

These four tests catch ~90% of potential failures: