sd-coder

ben20:31

You are sd-coder, the sole owner of the Spec-decoding over Ethernet pilot project.

Project Context

Project in Linear: ‘Spec-decoding over Ethernet’. Five milestone issues are already filed:

ENG-231: M0 — Provision dual-GPU testbed on Parasail infra
ENG-232: M1 — Baseline colocated spec decoding
ENG-233: M2 — Spec decoding over Ethernet
ENG-234: M3 — Agent process writeup
ENG-235: M4 — Heterogeneous SD with Dmatrix (gated on M2, don’t start yet)

First Steps

Create a new local repo at ~/work/sd-ethernet:
- cd ~/work && mkdir sd-ethernet && cd sd-ethernet && git init
- Create a README.md pointing at the Linear project
- Push to a new internal repo: gh repo create parasail-ai/sd-ethernet –internal –source=. –push
Start working on M0 (provision dual-GPU testbed). File sub-issues under ENG-231 in Linear as you discover concrete tasks.

Working Rules

ALL your work (code, scripts, docs, configs) goes in ~/work/sd-ethernet. Make commits as you progress. Other directories/repos under ~/work are available for reference only — do NOT modify them.
Maintain AGENT_LOG.md in the repo with your decision points, dead ends, and recoveries. This is the M3 deliverable.
File sub-issues under each milestone in Linear as you discover concrete tasks. Use the linear CLI (you’ll need to export LINEAR_API_KEY from ~/work/.envrc first).
Post status updates as Linear comments on the relevant milestone issue.
Use Qwen2.5-7B-Instruct as target and a smaller Qwen (0.5B or 1.5B) as draft for M1/M2.
Success bar for M2: faster than no-SD on the target node.
Ask before burning significant infra resources.

Linear Access

Source ~/work/.envrc to get LINEAR_API_KEY. Use the linear CLI or the GraphQL API.

Parasail Infra

You have access to Parasail infrastructure through the parasail repo at ~/work/parasail (read-only reference). Check how other team members provision GPU nodes. You may need to coordinate with the team for node access — if you need help, file a comment on ENG-231.

Start now. Begin with creating the repo and then working M0.

ben22:43

Green light on M1 (ENG-232). Proceed with the colocated spec-decoding baseline: scale pods back up, run the baseline, capture numbers (throughput, acceptance rate, latency per token, end-to-end). Re-check the draft->target / target->draft asymmetry flagged in M0 (85ms vs 215ms) — want to know if it’s real or a measurement artifact before we get to M2. Commit as you go, file sub-issues under ENG-232 as needed, report back when you have numbers or hit a wall.

ben02:23

Model Pair Update for Speculative Decoding (M1)

We researched the Qwen vocab mismatch blocker. Here’s the decision:

Good News: Qwen3 has unified vocab!

Qwen3 is released and ALL sizes (0.6B, 1.7B, 4B, 8B, 14B, 32B) share the SAME vocab_size of 151,936. This means Qwen3 draft/target pairs are now viable.

Recommended pair: Qwen3-8B (target) + Qwen3-0.6B (draft)

This is directly analogous to your original Qwen2.5-7B + Qwen2.5-1.5B pair but without the vocab mismatch. The 0.6B is the smallest available and should have minimal VRAM overhead as draft.

Alternative pair if 0.6B acceptance rate is too low: Qwen3-8B + Qwen3-1.7B

Next Steps

Update your colocated SD manifests to use Qwen3-8B-Instruct as target and Qwen3-0.6B as draft model
Re-run the colocated no-SD baseline with Qwen3-8B to get a fresh comparison point
Run colocated SD with the Qwen3-8B + Qwen3-0.6B pair, capturing throughput, acceptance rate, and latency
If acceptance rate is poor with 0.6B, try Qwen3-1.7B as draft
Continue with M1 completion and then M2 (Ethernet SD)

Important vLLM Notes

Make sure you’re using the –speculative_config JSON format (the old –speculative_model flag is deprecated)
vLLM docs show the format: –speculative_config ‘{“draft_model”: “Qwen/Qwen3-0.6B”, “num_speculative_tokens”: 5}’
If you hit any eagle3/model_type issues with Qwen3, there’s a known workaround (vllm issue #23464) but standard draft-model SD should work fine

Fallback Plan

If Qwen3 pairs don’t work for any reason, switch to Llama 3.1-8B (target) + Llama 3.2-1B (draft) — these have proven SD compatibility with consistent 128K vocab.

Please update AGENT_LOG.md with your plan and proceed.

ben15:01

Please summarize your recent work on the spec-decoding over Ethernet project. What milestones have you completed, what’s in progress, and what are the key findings so far? Check your AGENT_LOG.md and any recent commits in ~/work/sd-ethernet.

ben15:08

Two things:

Investigate why speculative decoding acceptance rates are so low (23.8% for 0.6B, 40.5% for 1.7B). Look into: prompt format alignment between draft and target models, quantization mismatches, speculative config parameters (num_speculative_tokens, temperature, etc.), and whether the Qwen3 draft/target pairing is well-matched. We need to understand if this is fixable before deciding whether to proceed to M2 as a real perf story or just a feasibility exercise.
If you’re not actively running experiments on the A100 test nodes (massed-iowa-a100x8-3 and -5), shut them down so they can be reclaimed. Don’t let them sit idle and burn customer capacity.

sd-coder running