Speculative context pre-fetching (JIT optimization)

t-678·WorkTask·
·
·
Created1 month ago·Updated1 month ago·pipeline runs →

Dependencies

Description

Edit

Summary

Extend the JIT context rehydration (t-677) with speculative pre-fetching: instead of waiting for the model to call request_context, the compiler predicts what the model will likely need mid-turn and pre-hydrates those sections in the background.

Motivation

In PL JIT history, the interesting compilers (V8 TurboFan, LuaJIT trace compiler) evolved from 'compile hot paths on demand' to 'speculatively compile predicted paths with bailout.' The same evolution applies here.

With basic JIT (t-677), the model calls request_context, waits for hydration, then continues. With speculative pre-fetching, the compiler predicts 2-3 likely-needed sections at compile time and hydrates them in the background. If the model asks for one, it's instant (cache hit). If not, the pre-fetch is discarded (cheap bailout).

Design

Prediction signals

At initial compile time, the compiler has rich signals for prediction:

  • Task type: coding tasks likely need file contents, research tasks likely need web/memory
  • Conversation trajectory: recent topic shifts suggest what context the model will reach for
  • Historical patterns: which request_context calls followed which task types in past runs (trace-based optimization)
  • Section adjacency: if the model asked for section X, it often also needs section Y

Architecture

1. Prediction pass: After AOT compilation, run a lightweight prediction pass that scores candidate sections by likelihood of being requested 2. Background hydration: Top-N candidates are hydrated in parallel (async), stored in a warm cache keyed by section ID / query signature 3. Cache integration: When request_context fires in the interpreter, check the warm cache first. Cache hit = instant splice, no hydration latency. Cache miss = fall back to on-demand hydration (t-677 behavior) 4. Bailout: Unused pre-fetched sections are discarded at end of turn. No wasted prompt budget — they only enter the prompt if explicitly requested.

Why this works better in compiler approach than LCM

LCM's runtime approach can't predict what the model will need because it doesn't have a global view of the task. Our compiler sees the full ContextRequest, task metadata, and historical patterns at compile time — enough to make useful predictions before the model even starts reasoning.

Metrics

  • Cache hit rate on speculative pre-fetches
  • Latency reduction on request_context calls (cache hit vs miss)
  • Wasted hydration cost (pre-fetches that were never used)

Dependencies

  • Depends on t-677 (basic JIT context rehydration) — implement and validate that first
  • Related: t-399 (rate-distortion pruning), t-432 (compaction), t-345 (RLM integration)

Implementation note

This is an optimization pass. Only implement after t-677 is working and we have data on what request_context calls the model actually makes in practice. The prediction model should be informed by real usage traces, not guesses.

Timeline (0)

No activity yet.