Skip to content

EXP-02 — Three Engineers, One Standard

EXP-01 proved the guide changes the outcome in a single run. EXP-02 proves that change is stable. Same task, 3 runs each, measuring whether the guide produces consistent output or just gets lucky once.

22.2×variance reduction
3/3runs passed with guide
0/3runs passed without

Violations across 3 runs

Without guide With guide
───────────────────── ──────────────────────────────
Run 1: 2 violations ✗ Run 1: 0 violations ✓ pass
Run 2: 1 violation ✗ Run 2: 0 violations ✓ pass
Run 3: 1 violation ✗ Run 3: 0 violations ✓ pass
Variance: 0.22 Variance: 0.00

Capabilities wired

Without guide With guide
───────────────────── ──────────────────────────────
(varies by run) http-api ✓ consistent
mysql ✓ consistent
platform ✓ consistent
bootstrap ✓ consistent

Per-run breakdown (without guide)

Dep Fn Anti-pattern Arch Total Result
Run 1 0 0 1 1 2 FAIL
Run 2 0 0 1 0 1 FAIL
Run 3 0 0 1 0 1 FAIL

The uuid_v4_as_key anti-pattern appeared on every control run — the agent reaches for uuid.New() non-deterministically. The architecture violation appeared once (run 1) and not again, creating the variance.

Without guideWith guide
Violations[2, 1, 1][0, 0, 0]
Variance0.220.00
Pass rate0/33/3

The guide doesn’t make agents better on average — it makes them consistently correct.

  • Task: Same as EXP-01 — build an order management service, free-form
  • Agent: claude-sonnet-4-6 via claude -p --output-format json
  • Runs: 3 per condition
  • Primary metric: variance in violation count across runs

→ Experiment Methodology — reproduction instructions

→ Artifacts on GitHub

→ EXP-03: Does it replicate on a different task?