AI-Augmented Quant Pipeline: Research Notes

Date: 2026-03-11 Context: Ben wants to use Omni/Agent (Op free monad) to automate signal discovery and alpha combination (steps 1 & 2 of the quant pipeline), feeding into his existing Omni/Fund/Invest.hs portfolio model (Kelly optimization, Monte Carlo, rebalancing).


1. Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│  SIGNAL DISCOVERY AGENTS (Op programs)                          │
│                                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │
│  │ EDGAR    │  │ Macro    │  │ Sentiment│  │ Price    │        │
│  │ Agent    │  │ Agent    │  │ Agent    │  │ Agent    │        │
│  │(Form 4, │  │(FRED,    │  │(Earnings │  │(Momentum │        │
│  │ 10-K,   │  │ BLS,     │  │ calls,   │  │ Mean-rev │        │
│  │ 8-K)    │  │ Treasury)│  │ News NLP)│  │ Vol)     │        │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘        │
│       │              │              │              │              │
│       └──────────────┴──────────────┴──────────────┘              │
│                          │                                        │
│                    Op.par [...]                                   │
│                          │                                        │
│                    ┌─────▼──────┐                                 │
│                    │ ALPHA      │                                 │
│                    │ COMBINER   │                                 │
│                    │ (Bayesian  │                                 │
│                    │  update of │                                 │
│                    │  μ, σ, Σ)  │                                 │
│                    └─────┬──────┘                                 │
└──────────────────────────┼───────────────────────────────────────┘
                           │
                     JSON output:
                     updated AssetModel params
                           │
┌──────────────────────────▼───────────────────────────────────────┐
│  EXISTING HASKELL PIPELINE (Invest.hs)                           │
│                                                                   │
│  AssetModel(μ,σ,yield) → kellyOptimalN → runSimulation          │
│       → computeDeltas → rebalancing signals on invest page       │
└──────────────────────────────────────────────────────────────────┘

Key Insight: Agent = Eyes & Ears, Haskell = Brain

The LLM agents do information gathering and structuring (what a human analyst does). The math (Kelly, MC, optimization) stays in deterministic Haskell code. The agent NEVER outputs portfolio weights or expected return numbers directly— it outputs structured signal data that deterministic code converts into parameter updates.


2. What Already Exists

Invest.hs (the brain)

Op.hs (the agent framework)

Config.hs (the parameters to update)


3. Signal Sources (Free/Public Data)

Tier 1: Easy to implement, well-documented APIs

Source Data Signal Type API Update Freq
SEC EDGAR Form 4 insider trades Insider sentiment data.sec.gov (free, no key) Real-time
FRED 840K macro series (M2, yield curve, CPI, unemployment) Macro regime api.stlouisfed.org (free key) Daily-monthly
Yahoo Finance OHLCV price history Momentum, vol, mean-reversion Unofficial REST API (or Alpha Vantage) Daily
Treasury.gov Yield curves Risk-free rate, term premium api.fiscaldata.treasury.gov Daily
CFTC COT Futures positioning Sentiment/positioning cftc.gov (CSV) Weekly

Tier 2: Moderate effort, high value

Source Data Signal Type Update Freq
Earnings transcripts Call text + estimates PEAD, sentiment Quarterly
Patent filings USPTO PAIR Innovation signal Monthly
Job postings Indeed/LinkedIn scrape Growth signal Weekly
App store rankings Apple/Google Revenue proxy Daily
FINRA Short Interest Short % of float Crowding/squeeze Bi-weekly

Tier 3: Needs more infra but powerful

Source Data Signal Type
Reddit/Twitter sentiment NLP on financial subs Retail sentiment
Satellite/weather NOAA, Sentinel Commodity/agriculture
Government procurement SAM.gov, USAspending Revenue leading indicator
Shipping data AIS vessel tracking Global trade leading indicator

Recommendation for v1: Start with Tier 1 only. EDGAR + FRED + market data gives you insider signal + macro regime + price-based signals. That’s a complete foundation.


4. The Alpha Combination Problem

Current Approach: Static Priors (What Invest.hs does now)

μ_BTC = 0.15  (hardcoded)
σ_BTC = 0.65  (hardcoded)

Target Approach: Black-Litterman-style Bayesian Updating

The Black-Litterman model is the right framework here. It:

  1. Starts with a prior (your current Config.hs values)
  2. Incorporates views (signals from agents) with confidence levels
  3. Outputs a posterior (updated μ, σ, Σ)

Concretely:

Prior:     μ_prior = [0.15, 0.10, 0.08, 0.10]  (BTC, equities, RE, STRD)
           Σ_prior = diagonal([0.65², 0.18², 0.12², 0.05²])

Agent views (example):
  - "BTC 30-day realized vol is 45%, below historical mean" → confidence 0.7
  - "Insider buying cluster in XYZ (equities)" → confidence 0.5
  - "Yield curve inverted: recession signal" → confidence 0.6
  - "BTC trailing 90-day momentum positive" → confidence 0.4

Bayesian update:
  μ_posterior = μ_prior + τΣP'(PτΣP' + Ω)⁻¹(Q - Pμ_prior)
  where P = picking matrix, Q = view returns, Ω = view uncertainty

Output: updated AssetModel parameters for Invest.hs

The key insight: the agent produces the views (Q) and confidence levels (Ω), the Haskell code does the Bayesian math.

What needs to be built in Invest.hs:

  1. Correlation matrix support (currently assumes uncorrelated)

    • Full Kelly: f* = Σ⁻¹μ (matrix inverse, not scalar division)
    • Need Cholesky decomposition for correlated MC paths
  2. Black-Litterman update function

    blUpdate :: CovMatrix -> [View] -> (ReturnVector, CovMatrix)
    
  3. Signal → View converter

    -- Agent output format
    data Signal = Signal
      { sigAsset :: Text
      , sigType :: SignalType  -- Momentum | MeanReversion | Insider | Macro | ...
      , sigStrength :: Double  -- z-score or normalized value
      , sigConfidence :: Double  -- 0-1
      , sigSource :: Text  -- provenance
      , sigTimestamp :: UTCTime
      }
    
    -- Convert to Black-Litterman view
    signalToView :: Signal -> View
    

5. Native Haskell Data Libraries + Op Programs

Architecture Decision

No external Python calls. All data access is native Haskell, using typed wrappers around REST/JSON APIs built on http-conduit (same stack as Omni.Agent.Tools.Http).

Three stepping-stone libraries:

These are standalone, useful libraries independent of the agent pipeline. They become the data foundation that Op programs call via typed Haskell functions rather than shelling out to Python.

5.1 Omni.Fund.Data.Edgar

SEC EDGAR API is free, no auth, JSON at data.sec.gov.

Key endpoints:

Ref: https://www.sec.gov/search-filings/edgar-application-programming-interfaces

module Omni.Fund.Data.Edgar where

-- | Company submissions (filing history)
data Submissions = Submissions
  { subCik        :: Text
  , subName       :: Text
  , subTickers    :: [Text]
  , subFilings    :: [Filing]
  }

data Filing = Filing
  { filingType    :: Text        -- "4", "10-K", "8-K", etc.
  , filingDate    :: Day
  , filingAccNo   :: Text        -- accession number
  , filingUrl     :: Text
  }

-- | Insider transaction from Form 4
data InsiderTransaction = InsiderTransaction
  { itReportingPerson :: Text
  , itRelationship    :: Text    -- "Officer", "Director", "10% Owner"
  , itTransactionType :: Text    -- "P" (purchase), "S" (sale)
  , itShares          :: Double
  , itPricePerShare   :: Double
  , itDate            :: Day
  , itTicker          :: Text
  }

-- Core API functions
getSubmissions :: Text -> IO (Either EdgarError Submissions)
getForm4Filings :: Text -> Int -> IO (Either EdgarError [InsiderTransaction])
getCompanyFacts :: Text -> IO (Either EdgarError CompanyFacts)
lookupCik :: Text -> IO (Either EdgarError Text)  -- ticker -> CIK

Note: EDGAR requires a User-Agent header with contact info per SEC policy.

5.2 Omni.Fund.Data.Market

For price/volume data. Alpha Vantage has a free tier (25 req/day) with clean REST API. Alternatively, Twelve Data or Polygon.io. All are simple JSON APIs.

module Omni.Fund.Data.Market where

data OHLCV = OHLCV
  { oDate   :: Day
  , oOpen   :: Double
  , oHigh   :: Double
  , oLow    :: Double
  , oClose  :: Double
  , oVolume :: Integer
  }

data TimeSeriesInterval = Daily | Weekly | Monthly

-- Core API functions
getDailyPrices :: Text -> Int -> IO (Either MarketError [OHLCV])
  -- ^ ticker, num days

-- Derived computations (pure Haskell, no API call)
trailingReturn :: Int -> [OHLCV] -> Double
  -- ^ window size, price history -> annualized return

realizedVol :: Int -> [OHLCV] -> Double
  -- ^ window size -> annualized volatility

meanReversionZ :: Int -> [OHLCV] -> Double
  -- ^ SMA window -> z-score of current price vs SMA

correlationMatrix :: [[OHLCV]] -> Matrix Double
  -- ^ price histories for N assets -> NxN correlation matrix

The pure computation functions (trailing return, vol, z-score, correlation) are deterministic math that lives in Haskell — no LLM, no external calls. These replace the numpy computations from the original Python design.

5.3 Omni.Fund.Data.Fred

FRED API: free with API key, REST/JSON, well-documented. Ref: https://fred.stlouisfed.org/docs/api/fred/

Evaluate gborough/fred on Hackage first — if it works, use it. If stale, write a thin wrapper (the API is ~10 endpoints, mostly series/observations).

module Omni.Fund.Data.Fred where

data FredSeries = FredSeries
  { fsId          :: Text       -- e.g. "T10Y2Y"
  , fsTitle       :: Text
  , fsFrequency   :: Text       -- "Daily", "Monthly"
  , fsUnits       :: Text
  }

data Observation = Observation
  { obsDate  :: Day
  , obsValue :: Maybe Double    -- FRED uses "." for missing
  }

-- Core API functions
getSeriesObservations :: Text -> Day -> Day -> IO (Either FredError [Observation])
  -- ^ series_id, start, end

getLatestValue :: Text -> IO (Either FredError Double)
  -- ^ series_id -> most recent observation

-- Convenience: pull a batch of key macro series
data MacroSnapshot = MacroSnapshot
  { msYieldCurve   :: Double    -- T10Y2Y
  , msM2Growth     :: Double    -- M2SL yoy change
  , msUnemployment :: Double    -- UNRATE
  , msCPI          :: Double    -- CPIAUCSL
  , msHYSpread     :: Double    -- BAMLH0A0HYM2
  , msVIX          :: Double    -- VIXCLS
  , msTimestamp     :: UTCTime
  }

getMacroSnapshot :: IO (Either FredError MacroSnapshot)

Key FRED series for signal pipeline:

5.4 Op Programs Using Native Libraries

With the data libraries in place, the Op programs become clean compositions:

signalScan :: [Text] -> Op SignalState [Signal]
signalScan assets = do
  Op.checkpoint "init"

  results <- Op.par
    [ insiderSignals assets
    , macroSignals
    , priceSignals assets
    ]

  Op.checkpoint "signals-gathered"
  let allSignals = concat results

  -- LLM assesses cross-signal coherence
  coherenceAdjusted <- assessCoherence allSignals
  pure coherenceAdjusted

-- Insider signals: native EDGAR API, LLM interprets
checkInsider :: Text -> Op SignalState [Signal]
checkInsider ticker = do
  -- Direct Haskell call, no Python
  filings <- Op.io $ Edgar.getForm4Filings ticker 10

  case filings of
    Left err -> do
      Op.log ("EDGAR error for " <> ticker <> ": " <> show err)
      pure []
    Right txns -> do
      -- Filter significant purchases
      let significant = filter isSignificantPurchase txns
      -- LLM interprets patterns (optional — could be pure rules)
      if null significant
        then pure []
        else do
          response <- Op.infer (Op.Model "claude-sonnet-4-20250514")
            defaultContextRequest
              { crObservation = "Analyze these insider transactions for " <> ticker
                             <> ":\n" <> formatTransactions significant
              , crGoal = Just "Extract insider trading signals"
              }
          pure (parseInsiderSignals response)

-- Macro signals: native FRED API, LLM interprets regime
macroSignals :: Op SignalState [Signal]
macroSignals = do
  snapshot <- Op.io Fred.getMacroSnapshot
  case snapshot of
    Left err -> do
      Op.log ("FRED error: " <> show err)
      pure []
    Right ms -> do
      response <- Op.infer (Op.Model "claude-sonnet-4-20250514")
        defaultContextRequest
          { crObservation = formatMacroSnapshot ms
          , crGoal = Just "Assess macro regime and generate signals"
          }
      pure (parseMacroSignals response)

-- Price signals: native Market API, pure Haskell math, no LLM needed
priceSignals :: [Text] -> Op SignalState [Signal]
priceSignals tickers = do
  histories <- Op.io $ mapM (\t -> (t,) <$> Market.getDailyPrices t 252) tickers
  pure $ concatMap mkPriceSignals histories
  where
    mkPriceSignals (ticker, Right prices) =
      [ Signal ticker "momentum" (Market.trailingReturn 63 prices) 0.7 "market_90d"
      , Signal ticker "volatility" (negate $ Market.realizedVol 63 prices) 0.8 "market_vol"
      , Signal ticker "mean_reversion" (Market.meanReversionZ 200 prices) 0.5 "market_zscore"
      ]
    mkPriceSignals (_, Left _) = []

Note: priceSignals is entirely deterministic — pure Haskell math on market data. No LLM involvement. The agent framework is used for orchestration (Op.par, Op.io) but the computation is typed, testable, and reproducible.

5.5 Implementation Order

  1. Omni.Fund.Data.Edgar — most valuable signal source, clean API, no auth needed
  2. Omni.Fund.Data.Market — needed for price signals and correlation matrix
  3. Omni.Fund.Data.Fred — macro context, evaluate gborough/fred first
  4. Wire into Invest.hs — replace static μ/σ with data-driven estimates
  5. Op programs — orchestrate the above with agent framework

Steps 1-3 are independently useful even without the agent pipeline.

6. Integration with Invest.hs

6.1 Signal Output Format (JSON)

{
  "timestamp": "2026-03-11T03:00:00Z",
  "signals": [
    {
      "asset": "BTC",
      "type": "momentum",
      "strength": 1.2,
      "confidence": 0.6,
      "source": "market_90d_trailing",
      "detail": "90-day trailing return annualized: 42%"
    },
    {
      "asset": "BTC",
      "type": "volatility",
      "strength": -0.8,
      "confidence": 0.8,
      "source": "market_realized_vol",
      "detail": "63-day realized vol: 45% vs historical 65%"
    },
    {
      "asset": "equities",
      "type": "insider",
      "strength": 0.5,
      "confidence": 0.4,
      "source": "edgar_form4",
      "detail": "3 C-suite purchases >$100K in SPY components this week"
    },
    {
      "asset": "ALL",
      "type": "macro_regime",
      "strength": -0.3,
      "confidence": 0.5,
      "source": "fred_composite",
      "detail": "Yield curve flat, M2 growth decelerating, VIX elevated"
    }
  ],
  "correlation_matrix": {
    "assets": ["BTC", "equities", "real_estate", "STRD"],
    "matrix": [[1.0, 0.45, 0.1, 0.05],
               [0.45, 1.0, 0.3, 0.1],
               [0.1, 0.3, 1.0, 0.05],
               [0.05, 0.1, 0.05, 1.0]]
  },
  "updated_params": {
    "BTC": {"mu": 0.18, "sigma": 0.55},
    "equities": {"mu": 0.11, "sigma": 0.18},
    "real_estate": {"mu": 0.08, "sigma": 0.12},
    "STRD": {"mu": 0.10, "sigma": 0.05}
  }
}

6.2 New Haskell Code Needed

-- In Invest.hs or a new SignalIntegration.hs module:

-- | Read signal file produced by the agent pipeline
readSignals :: FilePath -> IO (Either Text SignalBundle)

-- | Update AssetModel parameters using Bayesian update
applySignals :: PortfolioModel -> SignalBundle -> PortfolioModel

-- | Black-Litterman update (the core math)
blUpdate 
  :: Vector Double        -- prior returns (μ)
  -> Matrix Double        -- prior covariance (Σ)  
  -> Double               -- tau (confidence scalar, ~0.05)
  -> Matrix Double        -- picking matrix (P)
  -> Vector Double        -- views (Q)
  -> Matrix Double        -- view uncertainty (Ω)
  -> (Vector Double, Matrix Double)  -- posterior (μ', Σ')

-- | Full Kelly with correlation matrix
kellyOptimalCorrelated
  :: Double               -- risk-free rate
  -> Vector Double        -- expected returns
  -> Matrix Double        -- covariance matrix
  -> Vector Double        -- optimal fractions (f* = Σ⁻¹μ)

-- | Correlated GBM paths (Cholesky decomposition)
simulateCorrelated 
  :: Matrix Double        -- Cholesky factor of Σ
  -> ...                  -- same args as current simulateOnePath

6.3 Integration Flow

1. Agent pipeline runs (daily cron or on-demand):
   - Op program executes signalScan
   - Outputs JSON to /var/fund/signals.json

2. fund-data daemon picks up signals.json on next refresh cycle (every 15 min)

3. Invest.hs reads signals.json:
   - Applies Bayesian update to prior μ/σ
   - Computes correlated Kelly weights
   - Runs MC simulation with updated parameters
   - Outputs deltas for invest page

4. Invest page shows:
   - Current signal readings with confidence
   - How signals changed the expected returns
   - Updated Kelly weights vs current allocation
   - MC fan chart with signal-adjusted parameters

7. Relevant Literature

AlphaAgent (Feb 2025)

Black-Litterman Model

Post-Earnings Announcement Drift (PEAD)


8. Implementation Plan

Phase 1: Market Data Foundation (1-2 weeks)

This phase involves NO LLM. Just native Haskell data access + math. No Python in the loop — all three data libraries are thin typed wrappers around REST/JSON APIs, built on the same http-conduit stack as Omni.Agent.Tools.Http.

Phase 2: Bayesian Integration (1-2 weeks)

This phase is pure Haskell math. Still no LLM.

Phase 3: Agent-Augmented Signals (2-3 weeks)

This is where the LLM enters. The agent interprets data, not computes numbers.

Phase 4: Feedback & Decay Tracking (ongoing)


9. Missing Pieces / Gaps

In Invest.hs:

  1. Correlation matrix — currently assumes uncorrelated. Need hmatrix or similar for linear algebra (matrix inverse, Cholesky).
  2. Dynamic parameters — currently reads static Config.hs. Need to read from signals.json and fall back to Config.hs defaults if no signals available.
  3. Signal display — invest page needs a “signals” section showing current readings and how they’re affecting the model.

In Op infrastructure:

  1. Scheduled execution — need a way to run Op programs on a cron schedule. Could use systemd timer + op-runner CLI, or integrate with agentd.
  2. Signal persistence — signals should be stored with timestamps so we can track decay over time. SQLite or just JSONL append?

Risks:

  1. Garbage in, garbage out — if the LLM misinterprets a signal, the Bayesian update will propagate the error. Mitigate with conservative τ (low confidence in views) and confidence clamping.
  2. Overfitting — backtesting on the same data used to develop signals. Need out-of-sample validation period.
  3. Latency — EDGAR filings are public instantly but our pipeline runs daily. For insider trading signals, same-day is fine. For price momentum, daily is fine. HFT signals are out of scope.

10. Quick Win: Dynamic μ/σ from Price Data (No LLM)

Before building the full agent pipeline, the single highest-value change:

-- Omni/Fund/UpdateParams.hs — run as daily systemd timer
-- Uses native Haskell data libraries, no external dependencies

module Omni.Fund.UpdateParams where

import Omni.Fund.Data.Market (getDailyPrices, trailingReturn, realizedVol, correlationMatrix)
import Data.Aeson (encode)
import qualified Data.ByteString.Lazy as BL

assets :: [(Text, Text)]  -- (API ticker, internal name)
assets = [("BTC-USD", "BTC"), ("SPY", "equities")]

updateParams :: IO ()
updateParams = do
  -- Pull 2 years of daily prices per asset
  priceHistories <- forM assets $ \(ticker, name) -> do
    prices <- getDailyPrices ticker 504  -- ~2 years trading days
    let mu    = trailingReturn 252 prices   -- 1-year trailing
        sigma = realizedVol 252 prices      -- 1-year realized vol
    pure (name, mu, sigma, prices)

  -- Compute correlation matrix across all assets
  let allPrices = map (\(_, _, _, ps) -> ps) priceHistories
      corrMatrix = correlationMatrix allPrices

  -- Write signal bundle
  let bundle = SignalBundle
        { sbTimestamp = now
        , sbParams = [ (name, mu, sigma) | (name, mu, sigma, _) <- priceHistories ]
        , sbCorrelation = corrMatrix
        }
  BL.writeFile "/var/fund/signals.json" (encode bundle)

This alone would make the invest page responsive to actual market conditions instead of using hardcoded assumptions. It’s the minimum viable signal pipeline. No Python, no external processes — just a Haskell executable on a timer.


11. Summary of Decisions Needed

  1. Haskell matrix library: hmatrix (C bindings, fast) vs pure Haskell (matrix, linear)? hmatrix is standard but has C dependency.

  2. Signal storage format: JSON file (simple) vs SQLite (queryable, historical)? Recommend: JSON file for v1, migrate to SQLite when tracking signal decay.

  3. Scheduling: systemd timer (simple) vs agentd integration (fancy)? Recommend: systemd timer for Phase 1-2, agentd for Phase 3+ when Op programs need budget/checkpoint/steering support.

  4. How much to trust agent views: τ parameter in Black-Litterman controls this. Start very conservative (τ = 0.01, views barely nudge the prior). Increase as we validate signal quality with realized IC measurements.

  5. Scope of asset universe: Current Invest.hs tracks ~5 assets (BTC, STRD, equities, RE, cash). Do we want to expand to individual stocks? Recommend: No for v1. Keep the asset universe small and focus on getting the pipeline working. Individual stock signals can come later.


References