Methodology
Every verikt experiment follows the same methodology. This page describes how experiments are structured, what artifacts they produce, and how to reproduce them.
Principles
Section titled “Principles”- Reproducible — anyone cloning the repo can run the same experiment and get comparable results
- Multi-agent — the same experiment runs against any LLM, not just Claude
- Honest — null results, failures, and partial results are recorded and published as-is
- Auditable — every input is captured verbatim alongside every output
Fixture delivery modes
Section titled “Fixture delivery modes”The mode determines how (or whether) the agent sees existing code. This is the most important design decision per experiment.
Mode A — Greenfield
Section titled “Mode A — Greenfield”No existing code. The agent builds from scratch. The only inputs are the system prompt and the task prompt.
verikt check runs on the full generated project.
Used for: EXP-01, EXP-02, EXP-03, EXP-05, EXP-08.
Mode B — Embedded fixture
Section titled “Mode B — Embedded fixture”The fixture files are read from a committed fixture directory and embedded verbatim in the prompt. The agent sees the codebase as text. No tool access.
verikt check --diff HEAD runs after overlaying the agent’s output on the fixture. Only violations in the agent’s changed files count.
This mode is reproducible across all agents — every LLM can receive text. The embedded content is deterministic: same git SHA = same prompt.
Used for: EXP-04, EXP-07, EXP-09.
Mode C — Tool access
Section titled “Mode C — Tool access”The fixture is copied to a temp directory. The agent gets Read/Glob/Grep/Edit/Write tool access and works directly in that copy. Used for large codebases that can’t be embedded in a prompt.
verikt check --diff HEAD runs after the agent finishes. Only violations in agent-touched files count.
Currently Claude-only — the --allowedTools mechanism is specific to claude -p.
Used for: EXP-06c, EXP-06d.
Every experiment runs 3 times per condition. For a standard contrast experiment (control vs test): 6 total runs. For a 2×2 like EXP-05: 12 total runs.
Single-run results may be published with an explicit n=1 note. The harness always targets 3 regardless.
Agent abstraction
Section titled “Agent abstraction”Every stateless experiment (Mode A and B) runs through a single Agent interface:
type Agent interface { ID() string // "claude-sonnet-4-6", "gpt-4o", "gemini-2.0-flash" Call(ctx context.Context, systemPrompt, userPrompt string) (AgentResponse, error)}Two implementations:
claude-code(default) — callsclaude -p(Claude Code CLI pipe mode). No API key needed.openAICompatAgent— hits any OpenAI-compatible chat completions endpoint. One implementation covers Anthropic, OpenAI, Google, and Ollama.
Agent selection via environment variables:
VERIKT_EXPERIMENT_AGENT=1 # opt-in guardVERIKT_EXPERIMENT_VENDOR=claude-code # claude-code | anthropic | openai | google | ollamaVERIKT_EXPERIMENT_MODEL=claude-sonnet-4-6 # model ID| Vendor | Endpoint |
|---|---|
claude-code | claude -p CLI (no API key) |
anthropic | https://api.anthropic.com/v1 |
openai | https://api.openai.com/v1 |
google | https://generativelanguage.googleapis.com/v1beta/openai |
ollama | http://localhost:11434/v1 |
Cost is not tracked. Tokens are recorded per run — cost can be computed offline from token counts and model pricing.
Artifacts
Section titled “Artifacts”Every run produces a complete artifact set:
experiments/ EXP-01/ manifest.yaml ← hand-written experiment definition results/ claude-sonnet-4-6_control_run1_2026-03-15/ manifest.json ← verbatim inputs: prompts, agent, fixture SHA response.txt ← raw agent output, unmodified files/ ← generated files parsed from response verikt-check.json ← full verikt check output metrics.json ← violations, tokens, duration, pass/failmanifest.yaml
Section titled “manifest.yaml”Hand-written, one per experiment. Defines the hypothesis, conditions, fixture, and metrics:
id: EXP-04name: "New feature, existing service"hypothesis: > Without guide, agent places cancellation logic in the wrong layer. With guide, it follows domain/port/service/adapter boundaries.type: feature-additionfixture: orders-servicefixture_delivery: embeddedruns: 3conditions: - id: control guide: false - id: test guide: truemetrics: primary: violations_arch secondary: [violations_total, passed, hexagonal_shape]manifest.json
Section titled “manifest.json”Generated per run. Captures the exact inputs — prompts, model, fixture SHA:
{ "experiment_id": "EXP-04", "condition": "control", "run": 1, "agent": { "id": "claude-sonnet-4-6" }, "fixture": { "path": "orders-service", "sha256": "a3f2c1..." }, "system_prompt": "You are a Go engineer...", "task_prompt": "Add order cancellation...", "full_prompt_sha256": "b9d4e2..."}full_prompt_sha256 is a sha256 of the complete prompt sent to the model. Anyone reproducing the experiment can verify they’re sending the same prompt.
metrics.json
Section titled “metrics.json”Generated per run. The measured outcome:
{ "violations_dep": 0, "violations_fn": 0, "violations_ap": 0, "violations_arch": 2, "violations_total": 2, "passed": false, "hexagonal_shape": false, "files_generated": 4, "input_tokens": 1247, "output_tokens": 3421, "duration_ms": 34500}Fixture identity
Section titled “Fixture identity”Each fixture directory is identified by a sha256 of its contents (sorted file paths + file content). This is more stable than a git SHA — it doesn’t change when unrelated files change, and it’s branch-independent.
Experiment status
Section titled “Experiment status”| Status | Meaning |
|---|---|
complete | Run with correct methodology, results valid |
needs-rerun | Run but methodology was wrong (see note) |
partial | Some agents run, not all |
not-run | Never executed |
Experiment map
Section titled “Experiment map”| Experiment | Type | Mode | Status |
|---|---|---|---|
| EXP-01 | greenfield | A | complete |
| EXP-02 | greenfield | A | complete |
| EXP-03 | greenfield | A | complete |
| EXP-04 | feature-addition | B | complete (null result) |
| EXP-05 | greenfield | A | complete |
| EXP-06a/b | brownfield-stateless | A | complete (intentional null) |
| EXP-06c | brownfield | C | complete |
| EXP-06d | brownfield | C | complete |
| EXP-07 | feature-addition | B | complete (null result) |
| EXP-08 | greenfield | A | complete |
| EXP-09 | feature-addition | B | complete |
Reproducing a run
Section titled “Reproducing a run”# Build veriktgo build -o ./bin/verikt ./cmd/verikt/
# Run via Claude Code CLI (default — no API key needed)VERIKT_EXPERIMENT_AGENT=1 \go test -run TestEXP01 -v -timeout 300s ./internal/engineclient/experiment/
# Run with Anthropic API directlyVERIKT_EXPERIMENT_AGENT=1 \VERIKT_EXPERIMENT_VENDOR=anthropic \VERIKT_EXPERIMENT_MODEL=claude-sonnet-4-6 \ANTHROPIC_API_KEY=your-key \go test -run TestEXP01 -v -timeout 300s ./internal/engineclient/experiment/
# Run with GPT-4oVERIKT_EXPERIMENT_AGENT=1 \VERIKT_EXPERIMENT_VENDOR=openai \VERIKT_EXPERIMENT_MODEL=gpt-4o \OPENAI_API_KEY=your-key \go test -run TestEXP01 -v -timeout 300s ./internal/engineclient/experiment/
# Run with a local model via OllamaVERIKT_EXPERIMENT_AGENT=1 \VERIKT_EXPERIMENT_VENDOR=ollama \VERIKT_EXPERIMENT_MODEL=llama3.1:70b \go test -run TestEXP01 -v -timeout 600s ./internal/engineclient/experiment/Replace TestEXP01 with the test function for the experiment you want to run. Results are written to experiments/EXP-XX/results/ and experiments/index.json is updated. Commit both.