CUAD Contract Review - Analysis and Conclusions

t-369.25·WorkTask·
·
·
·Omni/Agent.hs
Parent:t-369·Created1 month ago·Updated1 month ago

Dependencies

Description

Edit

Analyze results from contract review experiments and draw conclusions.

Context

After running baseline (t-369.23) and swarm (t-369.24) experiments, analyze the data to determine: 1. Does STM-based swarm actually help? 2. If so, why? If not, why not? 3. What are the implications for the cognitive compute vision?

Deliverables

1. Quantitative Analysis

## Results Summary

### Scale Comparison
| Contracts | Single F1 | Swarm F1 | Single Time | Swarm Time |
|-----------|-----------|----------|-------------|------------|
| 5         | ?         | ?        | ?           | ?          |
| 10        | ?         | ?        | ?           | ?          |
| 20        | ?         | ?        | ?           | ?          |
| 50        | ?         | ?        | ?           | ?          |

### Per-Clause-Type Performance
| Clause Type | Single P/R/F1 | Swarm P/R/F1 | Delta |
|-------------|---------------|--------------|-------|
| Indemnification | ? | ? | ? |
| Liability | ? | ? | ? |
| ... | ... | ... | ... |

### Cost Analysis
| Mode | Tokens (N=50) | Cost (N=50) | Cost/Contract |
|------|---------------|-------------|---------------|
| Single | ? | ? | ? |
| Swarm | ? | ? | ? |

2. Qualitative Analysis

Did hints help?

  • Compare F1 of first 5 contracts vs last 5 in swarm
  • If hints help, later contracts should be better

What patterns were detected?

  • List patterns found
  • Are they accurate? Useful?

Where did single agent fail?

  • At what N did quality drop?
  • What kinds of clauses were missed?
  • Was it context exhaustion or something else?

Where did swarm fail?

  • Any coordination issues?
  • Did agents step on each other?
  • Were hints misleading?

3. Ablation Studies (if time permits)

  • Swarm WITHOUT pattern sharing (just parallel)
  • Swarm WITHOUT hints (just parallel + patterns)
  • Different numbers of concurrent reviewers

4. Conclusions Document

# CUAD Experiment Conclusions

## Key Finding
[One sentence: Did swarm help?]

## Evidence
[The data that supports the finding]

## Why This Happened
[Explanation of mechanism]

## Implications for Cognitive Compute

### If Swarm Helped:
- STM coordination is valuable for document review tasks
- Pattern learning across documents is a real advantage
- This validates the swarm approach for unstructured data

### If Swarm Didn't Help:
- Parallelism alone is sufficient
- STM overhead isn't worth it for this task type
- Need to find different task types where sharing matters

## Recommendations
[What to build/test next based on findings]

5. Update DESIGN.md

Add CUAD experiment results to Omni/Agent/DESIGN.md:

  • What we tested
  • What we learned
  • How it affects the architecture

Questions to Answer

1. Is there a "crossing point"?

  • N at which swarm starts beating single agent
  • Or does swarm always win/lose?

2. What's the sharing value?

  • Swarm vs Parallel-no-sharing on same task
  • Isolate benefit of STM coordination

3. What's the cost efficiency?

  • Quality per dollar for each approach
  • Which is more economical?

4. Is this task representative?

  • Would results generalize to other document types?
  • What makes a task "swarm-suitable"?

Files

  • Omni/Agent/Experiments/CUAD_ANALYSIS.md (detailed analysis)
  • Omni/Agent/DESIGN.md (updated with findings)

Timeline (2)

🔄[human]Open → InProgress1 month ago
🔄[human]InProgress → Done1 month ago