Manual test: orchestrator workflow with pi-code + pi-review

t-280.1·WorkTask·
·
·
·Omni/Ava.hs
Parent:t-280·Created1 month ago·Updated1 month ago

Description

Edit

Manually test the full orchestrator workflow before automating in Ava.

Workflow to Test

1. Pick a task from task ready 2. Run pi-code <task-id> 3. Run pi-review <task-id> 4. If REQUEST_CHANGES, run pi-code again with feedback 5. Repeat until APPROVE or REJECT

Questions to Answer

  • How many iterations typically needed?
  • What kinds of failures occur?
  • Is the feedback loop effective?
  • What information would Ava need to report?
  • What intervention points make sense?

Test Cases

  • Simple task (like adding --help)
  • Medium task (small feature)
  • Task that fails review first time
  • Task that gets rejected

Document Findings

Add notes to this task about what worked, what didn't, and what Ava's interface should handle.

Timeline (5)

🔄[human]Open → InProgress1 month ago
💬[human]1 month ago

Manual Testing Results

Test Cases Run

1. t-282 (Simple): Add --version flag

  • Coder: ✅ completed (minor stderr noise at end)
  • Reviewer: ✅ approved first try
  • Iterations: 1

2. t-283 (Vague requirements): Add color output

  • Coder: ✅ interpreted requirements reasonably
  • Reviewer: ✅ approved (perhaps too lenient - didn't note missing pi-review.sh changes)
  • Iterations: 1

3. t-284 (Complex feature): Add --dry-run flag

  • Coder: ✅ comprehensive implementation
  • Reviewer: ✅ thorough review, approved
  • Iterations: 1

Observations

What worked well:

  • Coder produces working code on simple-to-medium tasks
  • Reviewer gives detailed analysis
  • End-to-end flow is smooth
  • Task status updates correctly

Issues noted:

  • Coder sometimes has stderr noise ("command not found") but changes are fine
  • Reviewer may be too lenient (approved t-283 without noting inconsistency)
  • All 3 tests passed first try - need a failure case to test feedback loop

What Ava needs to handle:

  • Spawn subprocess, capture exit code
  • Parse final status (done vs needs-help)
  • Report summary to user (not full output)
  • Handle timeout for long-running tasks
  • Retry loop if REQUEST_CHANGES

Still need to test:

  • Task that gets REJECT
  • Task that gets REQUEST_CHANGES and requires retry
  • Complex multi-file refactoring
💬[human]1 month ago

Additional Finding: Untracked Files Bug

During t-285, discovered that pi-review.sh doesn't show untracked (new) files in the diff shown to the reviewer. The check for untracked files exists, but git diff only shows tracked files.

The reviewer correctly identified the issue and returned REQUEST_CHANGES, but for the wrong reason (thought file wasn't added vs. understanding the diff was incomplete).

Created t-286 to fix this.

💬[human]1 month ago

Final Testing Complete

Successfully tested pi-orchestrate.sh on t-286 (the untracked files bug).

Results

  • Orchestrator ran full cycle: coder → reviewer → commit
  • Completed in 1 iteration
  • Proper status tracking and exit
  • Clean output with progress indicators

What We Now Have

pi-code t-123       # Coder agent (claims task, makes changes)
pi-review t-123     # Reviewer agent (verify, approve/reject)
pi-orchestrate t-123 # Full loop with retries

Ready for Ava Integration

The bash prototypes work. Next steps: 1. t-280.2: Design Telegram interface 2. t-278.2/3: Convert to Haskell for Ava integration 3. t-280.3: Implement in Ava

🔄[human]InProgress → Done1 month ago