Skip to content

Methodology

Every verikt experiment follows the same methodology. This page describes how experiments are structured, what artifacts they produce, and how to reproduce them.

  • Reproducible — anyone cloning the repo can run the same experiment and get comparable results
  • Multi-agent — the same experiment runs against any LLM, not just Claude
  • Honest — null results, failures, and partial results are recorded and published as-is
  • Auditable — every input is captured verbatim alongside every output

The mode determines how (or whether) the agent sees existing code. This is the most important design decision per experiment.

No existing code. The agent builds from scratch. The only inputs are the system prompt and the task prompt.

verikt check runs on the full generated project.

Used for: EXP-01, EXP-02, EXP-03, EXP-05, EXP-08.

The fixture files are read from a committed fixture directory and embedded verbatim in the prompt. The agent sees the codebase as text. No tool access.

verikt check --diff HEAD runs after overlaying the agent’s output on the fixture. Only violations in the agent’s changed files count.

This mode is reproducible across all agents — every LLM can receive text. The embedded content is deterministic: same git SHA = same prompt.

Used for: EXP-04, EXP-07, EXP-09.

The fixture is copied to a temp directory. The agent gets Read/Glob/Grep/Edit/Write tool access and works directly in that copy. Used for large codebases that can’t be embedded in a prompt.

verikt check --diff HEAD runs after the agent finishes. Only violations in agent-touched files count.

Currently Claude-only — the --allowedTools mechanism is specific to claude -p.

Used for: EXP-06c, EXP-06d.

Every experiment runs 3 times per condition. For a standard contrast experiment (control vs test): 6 total runs. For a 2×2 like EXP-05: 12 total runs.

Single-run results may be published with an explicit n=1 note. The harness always targets 3 regardless.

Every stateless experiment (Mode A and B) runs through a single Agent interface:

type Agent interface {
ID() string // "claude-sonnet-4-6", "gpt-4o", "gemini-2.0-flash"
Call(ctx context.Context, systemPrompt, userPrompt string) (AgentResponse, error)
}

Two implementations:

  • claude-code (default) — calls claude -p (Claude Code CLI pipe mode). No API key needed.
  • openAICompatAgent — hits any OpenAI-compatible chat completions endpoint. One implementation covers Anthropic, OpenAI, Google, and Ollama.

Agent selection via environment variables:

Terminal window
VERIKT_EXPERIMENT_AGENT=1 # opt-in guard
VERIKT_EXPERIMENT_VENDOR=claude-code # claude-code | anthropic | openai | google | ollama
VERIKT_EXPERIMENT_MODEL=claude-sonnet-4-6 # model ID
VendorEndpoint
claude-codeclaude -p CLI (no API key)
anthropichttps://api.anthropic.com/v1
openaihttps://api.openai.com/v1
googlehttps://generativelanguage.googleapis.com/v1beta/openai
ollamahttp://localhost:11434/v1

Cost is not tracked. Tokens are recorded per run — cost can be computed offline from token counts and model pricing.

Every run produces a complete artifact set:

experiments/
EXP-01/
manifest.yaml ← hand-written experiment definition
results/
claude-sonnet-4-6_control_run1_2026-03-15/
manifest.json ← verbatim inputs: prompts, agent, fixture SHA
response.txt ← raw agent output, unmodified
files/ ← generated files parsed from response
verikt-check.json ← full verikt check output
metrics.json ← violations, tokens, duration, pass/fail

Hand-written, one per experiment. Defines the hypothesis, conditions, fixture, and metrics:

id: EXP-04
name: "New feature, existing service"
hypothesis: >
Without guide, agent places cancellation logic in the wrong layer.
With guide, it follows domain/port/service/adapter boundaries.
type: feature-addition
fixture: orders-service
fixture_delivery: embedded
runs: 3
conditions:
- id: control
guide: false
- id: test
guide: true
metrics:
primary: violations_arch
secondary: [violations_total, passed, hexagonal_shape]

Generated per run. Captures the exact inputs — prompts, model, fixture SHA:

{
"experiment_id": "EXP-04",
"condition": "control",
"run": 1,
"agent": { "id": "claude-sonnet-4-6" },
"fixture": { "path": "orders-service", "sha256": "a3f2c1..." },
"system_prompt": "You are a Go engineer...",
"task_prompt": "Add order cancellation...",
"full_prompt_sha256": "b9d4e2..."
}

full_prompt_sha256 is a sha256 of the complete prompt sent to the model. Anyone reproducing the experiment can verify they’re sending the same prompt.

Generated per run. The measured outcome:

{
"violations_dep": 0,
"violations_fn": 0,
"violations_ap": 0,
"violations_arch": 2,
"violations_total": 2,
"passed": false,
"hexagonal_shape": false,
"files_generated": 4,
"input_tokens": 1247,
"output_tokens": 3421,
"duration_ms": 34500
}

Each fixture directory is identified by a sha256 of its contents (sorted file paths + file content). This is more stable than a git SHA — it doesn’t change when unrelated files change, and it’s branch-independent.

StatusMeaning
completeRun with correct methodology, results valid
needs-rerunRun but methodology was wrong (see note)
partialSome agents run, not all
not-runNever executed
ExperimentTypeModeStatus
EXP-01greenfieldAcomplete
EXP-02greenfieldAcomplete
EXP-03greenfieldAcomplete
EXP-04feature-additionBcomplete (null result)
EXP-05greenfieldAcomplete
EXP-06a/bbrownfield-statelessAcomplete (intentional null)
EXP-06cbrownfieldCcomplete
EXP-06dbrownfieldCcomplete
EXP-07feature-additionBcomplete (null result)
EXP-08greenfieldAcomplete
EXP-09feature-additionBcomplete
Terminal window
# Build verikt
go build -o ./bin/verikt ./cmd/verikt/
# Run via Claude Code CLI (default — no API key needed)
VERIKT_EXPERIMENT_AGENT=1 \
go test -run TestEXP01 -v -timeout 300s ./internal/engineclient/experiment/
# Run with Anthropic API directly
VERIKT_EXPERIMENT_AGENT=1 \
VERIKT_EXPERIMENT_VENDOR=anthropic \
VERIKT_EXPERIMENT_MODEL=claude-sonnet-4-6 \
ANTHROPIC_API_KEY=your-key \
go test -run TestEXP01 -v -timeout 300s ./internal/engineclient/experiment/
# Run with GPT-4o
VERIKT_EXPERIMENT_AGENT=1 \
VERIKT_EXPERIMENT_VENDOR=openai \
VERIKT_EXPERIMENT_MODEL=gpt-4o \
OPENAI_API_KEY=your-key \
go test -run TestEXP01 -v -timeout 300s ./internal/engineclient/experiment/
# Run with a local model via Ollama
VERIKT_EXPERIMENT_AGENT=1 \
VERIKT_EXPERIMENT_VENDOR=ollama \
VERIKT_EXPERIMENT_MODEL=llama3.1:70b \
go test -run TestEXP01 -v -timeout 600s ./internal/engineclient/experiment/

Replace TestEXP01 with the test function for the experiment you want to run. Results are written to experiments/EXP-XX/results/ and experiments/index.json is updated. Commit both.