EXP-02 — Three Engineers, One Standard
EXP-01 proved the guide changes the outcome in a single run. EXP-02 proves that change is stable. Same task, 3 runs each, measuring whether the guide produces consistent output or just gets lucky once.
22.2×variance reduction
3/3runs passed with guide
0/3runs passed without
What the agent produced
Section titled “What the agent produced”Violations across 3 runs
Without guide With guide───────────────────── ──────────────────────────────Run 1: 2 violations ✗ Run 1: 0 violations ✓ passRun 2: 1 violation ✗ Run 2: 0 violations ✓ passRun 3: 1 violation ✗ Run 3: 0 violations ✓ passVariance: 0.22 Variance: 0.00Capabilities wired
Without guide With guide───────────────────── ──────────────────────────────(varies by run) http-api ✓ consistent mysql ✓ consistent platform ✓ consistent bootstrap ✓ consistentPer-run breakdown (without guide)
Dep Fn Anti-pattern Arch Total ResultRun 1 0 0 1 1 2 FAILRun 2 0 0 1 0 1 FAILRun 3 0 0 1 0 1 FAILThe uuid_v4_as_key anti-pattern appeared on every control run — the agent reaches for uuid.New() non-deterministically. The architecture violation appeared once (run 1) and not again, creating the variance.
Metrics
Section titled “Metrics”| Without guide | With guide | |
|---|---|---|
| Violations | [2, 1, 1] | [0, 0, 0] |
| Variance | 0.22 | 0.00 |
| Pass rate | 0/3 | 3/3 |
The guide doesn’t make agents better on average — it makes them consistently correct.
- Task: Same as EXP-01 — build an order management service, free-form
- Agent: claude-sonnet-4-6 via
claude -p --output-format json - Runs: 3 per condition
- Primary metric: variance in violation count across runs