Add a second large model (GPT-5.3) as devil's advocate in the spec-revision loop. This builds on MVP 2 by adding diverse model perspectives to catch blind spots.
Flow:
1. Task spec is drafted (by human or from initial task description)
2. Opus writes/revises the spec (as in MVP 2)
3. GPT-5.3 reviews the spec as devil's advocate: 'What is ambiguous, underspecified, or likely to cause implementation failures in this spec?'
4. If GPT-5.3 identifies issues, feed them back to Opus for revision
5. Amended spec goes to executor gate (Sonnet/Haiku)
6. Loop until convergence
Role separation (NOT adversarial):
- Opus: spec author/amender — writes and revises the spec
- GPT-5.3: devil's advocate/reviewer — identifies ambiguity and risk
- Sonnet/Haiku: executor gate — GO/NO-GO readiness check
Implementation:
- Add model routing to the spec loop (needs to call different LLM providers)
- GPT-5.3 review step happens between Opus revision and executor gate
- Gate this on task complexity — simple tasks (complexity 1-2) skip multi-model review and use MVP 2 flow only
- Track: how often GPT-5.3 catches issues Opus missed, correlation with task success
Oscillation prevention:
- Models play different roles (author vs reviewer), not both authoring
- GPT-5.3 identifies issues but does NOT rewrite the spec — Opus does all spec writing
- If the same issue cycles back after being addressed, flag it and bounce to human
Acceptance criteria:
- Complex tasks (complexity 3+) go through multi-model review
- Simple tasks (complexity 1-2) use MVP 2 single-model flow
- GPT-5.3 review comments are logged on the task
- No oscillation: max 2 Opus-GPT cycles per gate pass
- Metrics show whether multi-model review improves task success rate vs MVP 2 alone