t-369.25 - omni

t-369.25·WorkTask····Omni/Agent.hs

Parent:t-369·Created1 month ago·Updated1 month ago

Dependencies

t-369.24 [Blocks]

Description

Analyze results from contract review experiments and draw conclusions.

Context

After running baseline (t-369.23) and swarm (t-369.24) experiments, analyze the data to determine: 1. Does STM-based swarm actually help? 2. If so, why? If not, why not? 3. What are the implications for the cognitive compute vision?

Deliverables

1. Quantitative Analysis

## Results Summary

### Scale Comparison
| Contracts | Single F1 | Swarm F1 | Single Time | Swarm Time |
|-----------|-----------|----------|-------------|------------|
| 5         | ?         | ?        | ?           | ?          |
| 10        | ?         | ?        | ?           | ?          |
| 20        | ?         | ?        | ?           | ?          |
| 50        | ?         | ?        | ?           | ?          |

### Per-Clause-Type Performance
| Clause Type | Single P/R/F1 | Swarm P/R/F1 | Delta |
|-------------|---------------|--------------|-------|
| Indemnification | ? | ? | ? |
| Liability | ? | ? | ? |
| ... | ... | ... | ... |

### Cost Analysis
| Mode | Tokens (N=50) | Cost (N=50) | Cost/Contract |
|------|---------------|-------------|---------------|
| Single | ? | ? | ? |
| Swarm | ? | ? | ? |

2. Qualitative Analysis

Did hints help?

Compare F1 of first 5 contracts vs last 5 in swarm
If hints help, later contracts should be better

What patterns were detected?

List patterns found
Are they accurate? Useful?

Where did single agent fail?

At what N did quality drop?
What kinds of clauses were missed?
Was it context exhaustion or something else?

Where did swarm fail?

Any coordination issues?
Did agents step on each other?
Were hints misleading?

3. Ablation Studies (if time permits)

Swarm WITHOUT pattern sharing (just parallel)
Swarm WITHOUT hints (just parallel + patterns)
Different numbers of concurrent reviewers

4. Conclusions Document

# CUAD Experiment Conclusions

## Key Finding
[One sentence: Did swarm help?]

## Evidence
[The data that supports the finding]

## Why This Happened
[Explanation of mechanism]

## Implications for Cognitive Compute

### If Swarm Helped:
- STM coordination is valuable for document review tasks
- Pattern learning across documents is a real advantage
- This validates the swarm approach for unstructured data

### If Swarm Didn't Help:
- Parallelism alone is sufficient
- STM overhead isn't worth it for this task type
- Need to find different task types where sharing matters

## Recommendations
[What to build/test next based on findings]

5. Update DESIGN.md

Add CUAD experiment results to Omni/Agent/DESIGN.md:

What we tested
What we learned
How it affects the architecture

Questions to Answer

1. Is there a "crossing point"?

N at which swarm starts beating single agent
Or does swarm always win/lose?

2. What's the sharing value?

Swarm vs Parallel-no-sharing on same task
Isolate benefit of STM coordination

3. What's the cost efficiency?

Quality per dollar for each approach
Which is more economical?

4. Is this task representative?

Would results generalize to other document types?
What makes a task "swarm-suitable"?

Files

Omni/Agent/Experiments/CUAD_ANALYSIS.md (detailed analysis)
Omni/Agent/DESIGN.md (updated with findings)

CUAD Contract Review - Analysis and Conclusions