Skip to content

EXP-10 — Governance Checkpoint

Every previous experiment compared guide vs no guide. This one isolates a single section inside the guide: the governance checkpoint — an instruction that tells the agent to pause and verify architecture rules before modifying files and after receiving tool results.

The pattern comes from Anthropic’s think tool research, which showed 54% improvement in policy adherence when agents pause to verify rules at decision points.

0.89→0.00variance (greenfield)
25%fewer violations
80tokens added

Greenfield — violations across 3 runs

Without checkpoint With checkpoint
───────────────────── ──────────────────────────────
Run 1: 2 violations Run 1: 1 violation
Run 2: 0 violations Run 2: 1 violation
Run 3: 2 violations Run 3: 1 violation
Mean: 1.33 Mean: 1.00
Var: 0.89 Var: 0.00

Feature addition — violations across 3 runs

Without checkpoint With checkpoint
───────────────────── ──────────────────────────────
Run 1: 0 violations Run 1: 0 violations
Run 2: 0 violations Run 2: 0 violations
Run 3: 0 violations Run 3: 0 violations
Without checkpointWith checkpoint
Greenfield violations[2, 0, 2][1, 1, 1]
Greenfield mean1.331.00
Greenfield variance0.890.00
Feature addition violations[0, 0, 0][0, 0, 0]

Two things happened, one expected and one not.

Variance dropped to zero. Without the checkpoint, the same agent on the same task produced 0 violations one run and 2 the next. With the checkpoint, every run produced exactly 1 violation — a structural issue (missing component directory) the checkpoint can’t influence. The checkpoint didn’t just reduce violations, it made the agent deterministic.

Feature addition hit the ceiling. On an existing well-structured codebase, the agent produces 0 violations with or without the checkpoint. The existing code does the teaching — the guide and checkpoint add nothing measurable. This matches EXP-04 and EXP-07: brownfield with a conforming fixture leaves no room for improvement.

80 tokens, added to every guide output:

## Governance Checkpoint
Before modifying any file, pause and verify:
1. List which architecture rules from this guide apply to the change
you are about to make.
2. Check if the change introduces any pattern these rules explicitly prohibit.
3. If multiple rules conflict, state the conflict before proceeding.
After receiving tool results (test output, lint output, build errors),
re-check compliance before taking the next action.
Do not chain corrections without verifying each step against these rules.

The principle: telling agents what the rules are is necessary but not sufficient. Telling agents when to check — before acting and after feedback — is what produces consistency. This maps directly to Anthropic’s finding that the think tool’s effect was largest in “policy-heavy environments” with “sequential decision making where errors compound.”

  • Task (greenfield): Build a notification service in Go — email and SMS, HTTP API
  • Task (feature addition): Add a discount system to an existing orders service
  • Agent: claude-sonnet-4-6 via claude-code CLI
  • Runs: 3 per condition per fixture (12 total)
  • Variable isolated: Governance checkpoint section (both conditions receive the full guide)

→ Experiment Methodology — reproduction instructions

→ Artifacts on GitHub

→ Back to all experiments