Phase 0 eval spec for prompt ops

t-403·WorkTask·
·
·
Created1 month ago·Updated1 month ago

Description

Edit

Draft Phase 0 evaluation spec for Factor/Compress/Compose experiments and create initial spec doc.

Goal

Define a minimal, concrete spec format and metrics so t-394/395/396 can be tested before theory work.

Deliverables

1. Spec document (Markdown) describing:

  • Task suite format (YAML/JSON)
  • Validator types (command, regex, json schema, file diff)
  • Metrics (fidelity, trace drift, cost, latency, variance)
  • Run protocol (baseline vs variants, repeat count)

2. Example spec file with 3-5 test cases.

Notes

  • Prefer to align with existing eval harness conventions in Omni/Agent/Eval.
  • Keep it minimal and executable via agentd or Op runner later.
  • Use ASCII only.

Timeline (4)

🔄[human]Open → InProgress1 month ago
💬[human]1 month ago

Drafted Phase 0 spec and example suite: Omni/Agent/Eval/Phase0.md and Omni/Agent/Eval/phase0-suite.yaml

🔄[human]InProgress → Done1 month ago