The Pain
The reviewer’s email is two lines long. We could not reproduce Table 3 from the replication archive; the airport-zone elasticities differ in the third decimal. Please advise. You open the archive you submitted four months ago, the one you were certain ran clean, and you cannot make it run at all. The script reaches for a warehouse that was rebuilt twice since, a column that got renamed in the C2 incident, a random seed you set in a notebook cell that no longer exists. Somewhere between the result and the archive, the result stopped being a thing the archive produces and became a thing you remember producing.
The honest fix is the one no project ever budgets for: every few weeks, someone takes the replication package into a clean room — a fresh machine, an empty directory, nothing from your shell history — and rebuilds the whole analysis from scratch, then checks that the numbers that come out are the numbers you published. It is dull, exacting work, the kind a careful lab assigns to a person whose only job is to not trust the lab. You have never had that person. You have had the version of yourself who, at submission, was too relieved to be done to go back and prove that done meant reproducible. The two are not the same, and the gap between them is exactly where a reviewer lives.
Why / When
Everything in this course so far has been interactive: you at the keyboard, the agent answering, judgment passing back and forth. This lesson removes the keyboard. An agent invoked headlessly is a function — given a prompt and a working directory, it runs to completion with no human in the loop and returns a structured result a program can read: a JSON object, an exit code, a verdict.
That single change is what makes the next two lessons possible.
A function can be called — by a continuous-integration runner on every
pull request, by a scheduler at 3 a.m. (E2), by a replication package a
stranger downloads (the whole point of E3). The pipeline stage this
serves is not analysis; it is everything that wraps the analysis —
the gates, the reruns, the proofs that the work holds. The lab role it
absorbs is the reproducibility checker: the person who takes the archive
into a clean room and proves, mechanically, that the published numbers
fall out of it. A function does not get relieved and skip the check. It
returns pass or it returns fail, and it does it the same way every
time.
Contrary winds
Not for: a quick interactive question you'll read and act on yourself — wrapping a one-off lookup in a structured-output harness is ceremony, not reproducibility.
Mechanics
Both tools expose the same two surfaces for non-interactive use: a command-line headless mode that takes a prompt and emits structured output, and an SDK for embedding the agent in your own program. The shared shape first; the dialects below.
A function has a contract
An interactive session is a conversation; a headless run is a function call, and a function without a typed return is a function you cannot build on. Three properties separate a callable agent from a transcript you happen to have automated:
- Structured output — the run emits machine-readable result data (JSON), not prose you re-parse with a regex. The shape is the contract; downstream code reads fields, not sentences.
- A determinate exit — the process exits non-zero on failure, so a caller (a shell, a CI step) can branch on success without reading the output at all.
- Resumability — a long job that dies partway can be continued from where it stopped, by session id, rather than restarted from nothing.
Hold the structured-output property; the reproducibility self-test below is built entirely on it, and the CI gates after that are just callers.
Claude Code
The headless entry point is the -p (print) flag: a prompt in, a result
out, no session UI. Ask for JSON and you get a typed envelope.
claude -p "Rebuild the warehouse from scripts/ and report row counts per table" \ --output-format jsonThe --output-format json envelope carries the agent’s final message,
the session id, and run metadata (tokens, cost, turn count) as fields —
parse it with jq, not a regex. Two flags make it production-grade:
--output-format stream-jsonemits one JSON object per event for long runs you want to watch live (the CI log, a progress bar).--resume <session-id>continues a run that timed out or that you deliberately checkpointed — the session id from a prior run’s envelope is the handle.
For embedding the agent inside a program — a replication harness, a custom gate — reach for the Agent SDK (Python and TypeScript). The SDK is the same engine the CLI wraps, exposing the run as an async call that yields structured messages, so your harness can inspect each step and enforce its own contract on the result:
import anthropic_agent_sdk as sdk # the Agent SDK
async def rebuild(workdir: str) -> dict: async with sdk.Client(cwd=workdir, permission_mode="acceptEdits") as agent: result = await agent.run( "Rebuild warehouse.duckdb from scripts/, then print the row " "count of every table as JSON.", output_schema={"type": "object"}, # the contract: typed result ) return result.output # a dict your harness asserts againstCodex
The headless entry point is the exec subcommand: a prompt in, a result
out, no session UI. Ask for JSON and you get a typed envelope.
codex exec "Rebuild the warehouse from scripts/ and report row counts per table" \ --jsonThe --json envelope carries the agent’s final message, the thread id,
and run metadata (tokens, cost, turn count) as fields — parse it with
jq, not a regex. Two affordances make it production-grade:
- the streamed form emits one JSON object per event for long runs you want to watch live (the CI log, a progress bar).
exec resume <thread-id>continues a run that timed out or that you deliberately checkpointed — the thread id from a prior run’s envelope is the handle.
For embedding the agent inside a program — a replication harness, a custom gate — reach for the Codex SDK (TypeScript). The SDK is the same engine the CLI wraps, exposing the run as an async call that yields structured events, so your harness can inspect each step and enforce its own contract on the result:
import { Codex } from "@openai/codex-sdk";
async function rebuild(workdir: string): Promise<unknown> { const codex = new Codex({ cwd: workdir, approvalMode: "auto-edit" }); const thread = await codex.startThread(); const result = await thread.run( "Rebuild warehouse.duckdb from scripts/, then print the row count " + "of every table as JSON.", ); return JSON.parse(result.finalResponse); // the contract: typed result}The reproducibility self-test
Here is the research payoff, and it is the same in both tools because it is a discipline, not a feature. Point a headless run at a fresh clone of the repository — an empty directory, nothing from your machine — and ask it to rebuild the analysis from the replication scripts alone, then compare the numbers it produces against the numbers you published. The run emits one structured verdict:
{ "verdict": "pass", "rebuilt_from": "scripts/replicate.sh on a fresh clone (no warehouse cached)", "checks": [ { "table": "trips_raw", "expected_rows": 69_804_771, "got_rows": 69_804_771, "ok": true }, { "metric": "precip_elasticity_jfk", "published": 0.0087, "rebuilt": 0.0087, "abs_diff": 0.0, "ok": true }, { "metric": "precip_elasticity_lga", "published": 0.0091, "rebuilt": 0.0091, "abs_diff": 0.0, "ok": true } ]}The verdict is a function of the clean room, not of your memory. If a
script reaches for a warehouse that only exists on your laptop, the
fresh clone has no warehouse and the rebuild fails loudly, here, weeks
before a reviewer finds it. This is precisely F1’s make replicate —
the reproducibility checker, made mechanical: a clean-room rebuild that
returns pass or fail, run as a function, trusting nothing it cannot
regenerate from the committed scripts.
The shell that wraps it is language-neutral — the same script whether the analysis inside is Python or R — so it belongs to the project, not to a language:
set -euo pipefailwork="$(mktemp -d)"git clone --depth 1 "file://$PWD/.git" "$work" # a true fresh clonetrap 'rm -rf "$work"' EXIT
# the agent rebuilds in the clean room and writes the structured verdictcd "$work"bash scripts/rebuild_warehouse.sh # no cached warehouse exists herepython scripts/check_replication.py \ --published "$OLDPWD/results/published_estimates.json" \ --out results/replication_verdict.json
test "$(jq -r .verdict results/replication_verdict.json)" = passThe test on the last line is the determinate exit: the script
succeeds only when the verdict is pass, so any caller — a CI runner,
a scheduler — branches on it without reading a word.
CI gates: the function as a merge gate
A reproducibility verdict is only worth producing if something acts on
it. The caller that matters most is continuous integration: every pull
request that touches src/ runs the project’s gates headlessly and
blocks the merge if they fail. This is C2’s lesson at the scale of
the whole repository — a warning trains contributors to ignore it; a
gate that blocks the merge cannot be ignored. Two gates run on every
such PR, both as headless functions:
- the C2 contract suite — does every transform still honor its data contract;
- the D4 referee — the adversarial review pass, run as an isolated subagent over the PR’s diff, flagging the leakage and metric-gaming that no contract can name.
Evaluation becomes a merge gate. The dialects differ only in where the workflow lives and who triggers it.
Claude Code
The gate is a GitHub Actions workflow built on claude-code-action,
which runs the agent headlessly inside the runner. It triggers on any PR
touching src/, runs the two gates as headless calls, and fails the
check (blocking the merge) on a non-zero verdict:
on: pull_request: paths: ["src/**"]jobs: contracts-and-referee: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: anthropics/claude-code-action@v1 with: prompt: | Run scripts/validate_contracts.py over the warehouse and the D4 referee over this PR's diff. Emit results/gate_verdict.json with {"contracts": "...", "referee": "..."} and exit non-zero if either fails.The action checks out the PR, runs the agent headlessly with that prompt, and the job’s exit status is the merge gate. The referee’s findings post back to the PR as a review comment, so a flagged spec is visible in the diff, not buried in a log.
Codex
The gate is a GitHub Action that runs the agent headlessly inside the
runner. It triggers on any PR touching src/, runs the two gates as
headless calls, and fails the check (blocking the merge) on a non-zero
verdict:
on: pull_request: paths: ["src/**"]jobs: contracts-and-referee: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: openai/codex-action@v1 with: prompt: | Run scripts/validate_contracts.py over the warehouse and the D4 referee over this PR's diff. Emit results/gate_verdict.json with {"contracts": "...", "referee": "..."} and exit non-zero if either fails.There is a second, lighter trigger: mention the reviewer on a pull request and it performs the referee pass in the review thread — a human-initiated gate rather than a scheduled one. The two compose: the workflow blocks the merge mechanically; the on-demand PR review puts the referee’s findings inline where a maintainer reads them.
The gates are the same suite
What runs inside both workflows is identical — validate_contracts.py
from C2 and the referee skill from D4, called as functions. The CI
integration adds no new judgment; it adds a caller that runs the
existing judgment on every PR, headlessly, with the authority to block.
That is the whole move of this lesson: the work you already wrote
becomes a function, and a function can be called by something other than
you.
Guided Run — The Clean Room
claude -p "run scripts/replicate.sh and report the verdict" --output-format jsonGuided Run — The Clean Room
claude -p "run scripts/replicate.sh and report the verdict" --output-format jsonField Assignment
Artifact make check-e1 passes — headless replication verdict is `pass`, both CI gates wired
Turn the analysis into a function and prove it reproduces, then make the
proof a merge gate. The deliverable is a pass verdict from a clean
room and two gates that block.
Claude Code
- Write
scripts/replicate.sh: clone the repo into a temp dir, rebuild the warehouse from scripts alone (no cached warehouse), and run the replication check that emitsresults/replication_verdict.json. - Invoke the rebuild headlessly with
claude -p … --output-format jsonand confirm the envelope parses — final message, session id, cost — and that the verdict field readspass. - Add
.github/workflows/gates.ymlonclaude-code-action: on any PR touchingsrc/, run the C2 contract suite and the D4 referee headlessly; fail the check on either verdict. - Open a throwaway PR that breaks a contract on purpose and confirm the gate goes red — a gate you have not watched fail is a gate you do not have.
- File the run in
journal/: the verdict, the cost, and one thing the fresh clone needed that your laptop had been quietly supplying. Thenmake check-e1.
Codex
- Write
scripts/replicate.sh: clone the repo into a temp dir, rebuild the warehouse from scripts alone (no cached warehouse), and run the replication check that emitsresults/replication_verdict.json. - Invoke the rebuild headlessly with
codex exec … --jsonand confirm the envelope parses — final message, thread id, cost — and that the verdict field readspass. Resume it once withexec resumeto prove the run is a continuable function. - Add
.github/workflows/gates.ymlon the Codex Action: on any PR touchingsrc/, run the C2 contract suite and the D4 referee headlessly; fail the check on either verdict. - Open a throwaway PR that breaks a contract on purpose and confirm the gate goes red — a gate you have not watched fail is a gate you do not have.
- File the run in
journal/: the verdict, the cost, and one thing the fresh clone needed that your laptop had been quietly supplying. Thenmake check-e1.
make check-e1 verifies three artifacts: replicate.sh produces a
pass verdict from a genuinely fresh clone, the gate workflow exists and
references both the contract suite and the referee, and the deliberate
contract break turns the gate red. This is what E2 schedules and what F1
ships as make replicate.
make check-e1advances E1The clone is the clean room — it must trust nothing the laptop was supplying by hand.
Pin the fields your harness reads — structured output without a schema is parsing roulette.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Headless runs resolve config differently than interactive ones — different working directory, no shell profile, none of your environment’s quiet conveniences. The clean-room rebuild that passes on your laptop and fails in CI has usually found a real dependency you were supplying by hand. Test the headless path explicitly, in a fresh clone, before you trust the verdict — the divergence is the finding, not a nuisance.
- [both]
〜〜
Structured output without a schema is parsing roulette. “Return JSON” with no contract returns a JSON shape — a different one when the run goes sideways and apologizes in prose instead. Pin the fields your harness reads, validate them, and fail closed when the shape is wrong. A replication verdict you cannot parse is a replication you did not run.
- [both]
CI gates that only warn train contributors to ignore them — C2’s lesson, again, one level up. A referee comment that never blocks a merge is read once and then never; a check that goes red and stops the button is read every time. Integrity gates block: the exit code is the verdict, the same as it was for hooks.
- [CX]
A resumed run inherits the prior run’s state, not a clean slate — resuming is for continuing an interrupted job, never for re-asking a question you want answered fresh. If the replication self-test must trust nothing, it starts a new run in a new clone, not a resume of the one that already saw the answer.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
Headless invocation is genuine parity: both tools expose a print/exec mode that takes a prompt and emits a structured JSON envelope, both offer a resumable session by id, and both ship a GitHub Action that runs the agent inside a CI runner. The SDKs are where the surfaces diverge — Claude Code’s Agent SDK ships for both Python and TypeScript, while the Codex SDK is TypeScript-only as of this date, which matters if your replication harness wants to live next to your Python analysis code. The on-demand PR-review trigger (mention the reviewer on a pull request) is a Codex convenience with no exact Claude Code analogue; the scheduled workflow gate is symmetric. See the parity matrix for the dated detail.