EXP-10 — Governance Checkpoint
Every previous experiment compared guide vs no guide. This one isolates a single section inside the guide: the governance checkpoint — an instruction that tells the agent to pause and verify architecture rules before modifying files and after receiving tool results.
The pattern comes from Anthropic’s think tool research, which showed 54% improvement in policy adherence when agents pause to verify rules at decision points.
What the agent produced
Section titled “What the agent produced”Greenfield — violations across 3 runs
Without checkpoint With checkpoint───────────────────── ──────────────────────────────Run 1: 2 violations Run 1: 1 violationRun 2: 0 violations Run 2: 1 violationRun 3: 2 violations Run 3: 1 violationMean: 1.33 Mean: 1.00Var: 0.89 Var: 0.00Feature addition — violations across 3 runs
Without checkpoint With checkpoint───────────────────── ──────────────────────────────Run 1: 0 violations Run 1: 0 violationsRun 2: 0 violations Run 2: 0 violationsRun 3: 0 violations Run 3: 0 violationsMetrics
Section titled “Metrics”| Without checkpoint | With checkpoint | |
|---|---|---|
| Greenfield violations | [2, 0, 2] | [1, 1, 1] |
| Greenfield mean | 1.33 | 1.00 |
| Greenfield variance | 0.89 | 0.00 |
| Feature addition violations | [0, 0, 0] | [0, 0, 0] |
Finding
Section titled “Finding”Two things happened, one expected and one not.
Variance dropped to zero. Without the checkpoint, the same agent on the same task produced 0 violations one run and 2 the next. With the checkpoint, every run produced exactly 1 violation — a structural issue (missing component directory) the checkpoint can’t influence. The checkpoint didn’t just reduce violations, it made the agent deterministic.
Feature addition hit the ceiling. On an existing well-structured codebase, the agent produces 0 violations with or without the checkpoint. The existing code does the teaching — the guide and checkpoint add nothing measurable. This matches EXP-04 and EXP-07: brownfield with a conforming fixture leaves no room for improvement.
The checkpoint
Section titled “The checkpoint”80 tokens, added to every guide output:
## Governance Checkpoint
Before modifying any file, pause and verify:1. List which architecture rules from this guide apply to the change you are about to make.2. Check if the change introduces any pattern these rules explicitly prohibit.3. If multiple rules conflict, state the conflict before proceeding.
After receiving tool results (test output, lint output, build errors),re-check compliance before taking the next action.Do not chain corrections without verifying each step against these rules.The principle: telling agents what the rules are is necessary but not sufficient. Telling agents when to check — before acting and after feedback — is what produces consistency. This maps directly to Anthropic’s finding that the think tool’s effect was largest in “policy-heavy environments” with “sequential decision making where errors compound.”
- Task (greenfield): Build a notification service in Go — email and SMS, HTTP API
- Task (feature addition): Add a discount system to an existing orders service
- Agent: claude-sonnet-4-6 via claude-code CLI
- Runs: 3 per condition per fixture (12 total)
- Variable isolated: Governance checkpoint section (both conditions receive the full guide)