E1 advanced ~45 min

Agents as Functions: Headless & CI

Absorbs: the reproducibility checker

Advances E1

The Pain

The reviewer’s email is two lines long. We could not reproduce Table 3 from the replication archive; the airport-zone elasticities differ in the third decimal. Please advise. You open the archive you submitted four months ago, the one you were certain ran clean, and you cannot make it run at all. The script reaches for a warehouse that was rebuilt twice since, a column that got renamed in the C2 incident, a random seed you set in a notebook cell that no longer exists. Somewhere between the result and the archive, the result stopped being a thing the archive produces and became a thing you remember producing.

The honest fix is the one no project ever budgets for: every few weeks, someone takes the replication package into a clean room — a fresh machine, an empty directory, nothing from your shell history — and rebuilds the whole analysis from scratch, then checks that the numbers that come out are the numbers you published. It is dull, exacting work, the kind a careful lab assigns to a person whose only job is to not trust the lab. You have never had that person. You have had the version of yourself who, at submission, was too relieved to be done to go back and prove that done meant reproducible. The two are not the same, and the gap between them is exactly where a reviewer lives.

Why / When

Everything in this course so far has been interactive: you at the keyboard, the agent answering, judgment passing back and forth. This lesson removes the keyboard. An agent invoked headlessly is a function — given a prompt and a working directory, it runs to completion with no human in the loop and returns a structured result a program can read: a JSON object, an exit code, a verdict.

That single change is what makes the next two lessons possible. A function can be called — by a continuous-integration runner on every pull request, by a scheduler at 3 a.m. (E2), by a replication package a stranger downloads (the whole point of E3). The pipeline stage this serves is not analysis; it is everything that wraps the analysis — the gates, the reruns, the proofs that the work holds. The lab role it absorbs is the reproducibility checker: the person who takes the archive into a clean room and proves, mechanically, that the published numbers fall out of it. A function does not get relieved and skip the check. It returns pass or it returns fail, and it does it the same way every time.

Contrary winds

Not for: a quick interactive question you'll read and act on yourself — wrapping a one-off lookup in a structured-output harness is ceremony, not reproducibility.

Mechanics

Both tools expose the same two surfaces for non-interactive use: a command-line headless mode that takes a prompt and emits structured output, and an SDK for embedding the agent in your own program. The shared shape first; the dialects below.

A function has a contract

An interactive session is a conversation; a headless run is a function call, and a function without a typed return is a function you cannot build on. Three properties separate a callable agent from a transcript you happen to have automated:

Structured output — the run emits machine-readable result data (JSON), not prose you re-parse with a regex. The shape is the contract; downstream code reads fields, not sentences.
A determinate exit — the process exits non-zero on failure, so a caller (a shell, a CI step) can branch on success without reading the output at all.
Resumability — a long job that dies partway can be continued from where it stopped, by session id, rather than restarted from nothing.

Hold the structured-output property; the reproducibility self-test below is built entirely on it, and the CI gates after that are just callers.

Claude Code

The headless entry point is the -p (print) flag: a prompt in, a result out, no session UI. Ask for JSON and you get a typed envelope.

claude -p "Rebuild the warehouse from scripts/ and report row counts per table" \
  --output-format json

The --output-format json envelope carries the agent’s final message, the session id, and run metadata (tokens, cost, turn count) as fields — parse it with jq, not a regex. Two flags make it production-grade:

--output-format stream-json emits one JSON object per event for long runs you want to watch live (the CI log, a progress bar).
--resume <session-id> continues a run that timed out or that you deliberately checkpointed — the session id from a prior run’s envelope is the handle.

For embedding the agent inside a program — a replication harness, a custom gate — reach for the Agent SDK (Python and TypeScript). The SDK is the same engine the CLI wraps, exposing the run as an async call that yields structured messages, so your harness can inspect each step and enforce its own contract on the result:

import anthropic_agent_sdk as sdk   # the Agent SDK

async def rebuild(workdir: str) -> dict:
    async with sdk.Client(cwd=workdir, permission_mode="acceptEdits") as agent:
        result = await agent.run(
            "Rebuild warehouse.duckdb from scripts/, then print the row "
            "count of every table as JSON.",
            output_schema={"type": "object"},   # the contract: typed result
        )
    return result.output   # a dict your harness asserts against

Codex

The headless entry point is the exec subcommand: a prompt in, a result out, no session UI. Ask for JSON and you get a typed envelope.

codex exec "Rebuild the warehouse from scripts/ and report row counts per table" \
  --json

The --json envelope carries the agent’s final message, the thread id, and run metadata (tokens, cost, turn count) as fields — parse it with jq, not a regex. Two affordances make it production-grade:

the streamed form emits one JSON object per event for long runs you want to watch live (the CI log, a progress bar).
exec resume <thread-id> continues a run that timed out or that you deliberately checkpointed — the thread id from a prior run’s envelope is the handle.

For embedding the agent inside a program — a replication harness, a custom gate — reach for the Codex SDK (TypeScript). The SDK is the same engine the CLI wraps, exposing the run as an async call that yields structured events, so your harness can inspect each step and enforce its own contract on the result:

import { Codex } from "@openai/codex-sdk";

async function rebuild(workdir: string): Promise<unknown> {
  const codex = new Codex({ cwd: workdir, approvalMode: "auto-edit" });
  const thread = await codex.startThread();
  const result = await thread.run(
    "Rebuild warehouse.duckdb from scripts/, then print the row count " +
      "of every table as JSON.",
  );
  return JSON.parse(result.finalResponse); // the contract: typed result
}

The reproducibility self-test

Here is the research payoff, and it is the same in both tools because it is a discipline, not a feature. Point a headless run at a fresh clone of the repository — an empty directory, nothing from your machine — and ask it to rebuild the analysis from the replication scripts alone, then compare the numbers it produces against the numbers you published. The run emits one structured verdict:

{
  "verdict": "pass",
  "rebuilt_from": "scripts/replicate.sh on a fresh clone (no warehouse cached)",
  "checks": [
    { "table": "trips_raw",        "expected_rows": 69_804_771, "got_rows": 69_804_771, "ok": true },
    { "metric": "precip_elasticity_jfk", "published": 0.0087, "rebuilt": 0.0087, "abs_diff": 0.0, "ok": true },
    { "metric": "precip_elasticity_lga", "published": 0.0091, "rebuilt": 0.0091, "abs_diff": 0.0, "ok": true }
  ]
}

The verdict is a function of the clean room, not of your memory. If a script reaches for a warehouse that only exists on your laptop, the fresh clone has no warehouse and the rebuild fails loudly, here, weeks before a reviewer finds it. This is precisely F1’s make replicate — the reproducibility checker, made mechanical: a clean-room rebuild that returns pass or fail, run as a function, trusting nothing it cannot regenerate from the committed scripts.

The shell that wraps it is language-neutral — the same script whether the analysis inside is Python or R — so it belongs to the project, not to a language:

set -euo pipefail
work="$(mktemp -d)"
git clone --depth 1 "file://$PWD/.git" "$work"   # a true fresh clone
trap 'rm -rf "$work"' EXIT

# the agent rebuilds in the clean room and writes the structured verdict
cd "$work"
bash scripts/rebuild_warehouse.sh          # no cached warehouse exists here
python scripts/check_replication.py \
  --published "$OLDPWD/results/published_estimates.json" \
  --out results/replication_verdict.json

test "$(jq -r .verdict results/replication_verdict.json)" = pass

The test on the last line is the determinate exit: the script succeeds only when the verdict is pass, so any caller — a CI runner, a scheduler — branches on it without reading a word.

CI gates: the function as a merge gate

A reproducibility verdict is only worth producing if something acts on it. The caller that matters most is continuous integration: every pull request that touches src/ runs the project’s gates headlessly and blocks the merge if they fail. This is C2’s lesson at the scale of the whole repository — a warning trains contributors to ignore it; a gate that blocks the merge cannot be ignored. Two gates run on every such PR, both as headless functions:

the C2 contract suite — does every transform still honor its data contract;
the D4 referee — the adversarial review pass, run as an isolated subagent over the PR’s diff, flagging the leakage and metric-gaming that no contract can name.

Evaluation becomes a merge gate. The dialects differ only in where the workflow lives and who triggers it.

Claude Code

The gate is a GitHub Actions workflow built on claude-code-action, which runs the agent headlessly inside the runner. It triggers on any PR touching src/, runs the two gates as headless calls, and fails the check (blocking the merge) on a non-zero verdict:

on:
  pull_request:
    paths: ["src/**"]
jobs:
  contracts-and-referee:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: anthropics/claude-code-action@v1
        with:
          prompt: |
            Run scripts/validate_contracts.py over the warehouse and the
            D4 referee over this PR's diff. Emit results/gate_verdict.json
            with {"contracts": "...", "referee": "..."} and exit non-zero
            if either fails.

The action checks out the PR, runs the agent headlessly with that prompt, and the job’s exit status is the merge gate. The referee’s findings post back to the PR as a review comment, so a flagged spec is visible in the diff, not buried in a log.

Codex

The gate is a GitHub Action that runs the agent headlessly inside the runner. It triggers on any PR touching src/, runs the two gates as headless calls, and fails the check (blocking the merge) on a non-zero verdict:

on:
  pull_request:
    paths: ["src/**"]
jobs:
  contracts-and-referee:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: openai/codex-action@v1
        with:
          prompt: |
            Run scripts/validate_contracts.py over the warehouse and the
            D4 referee over this PR's diff. Emit results/gate_verdict.json
            with {"contracts": "...", "referee": "..."} and exit non-zero
            if either fails.

There is a second, lighter trigger: mention the reviewer on a pull request and it performs the referee pass in the review thread — a human-initiated gate rather than a scheduled one. The two compose: the workflow blocks the merge mechanically; the on-demand PR review puts the referee’s findings inline where a maintainer reads them.

The gates are the same suite

What runs inside both workflows is identical — validate_contracts.py from C2 and the referee skill from D4, called as functions. The CI integration adds no new judgment; it adds a caller that runs the existing judgment on every PR, headlessly, with the authority to block. That is the whole move of this lesson: the work you already wrote becomes a function, and a function can be called by something other than you.

Guided Run — The Clean Room

Field Terminal — session: e1-headless Claude Code

claude -p "run scripts/replicate.sh and report the verdict" --output-format json

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Guided Run — The Clean Room

Field Terminal — session: e1-headless Claude Code

claude -p "run scripts/replicate.sh and report the verdict" --output-format json

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make check-e1 passes — headless replication verdict is `pass`, both CI gates wired

Turn the analysis into a function and prove it reproduces, then make the proof a merge gate. The deliverable is a pass verdict from a clean room and two gates that block.

Claude Code

Write scripts/replicate.sh: clone the repo into a temp dir, rebuild the warehouse from scripts alone (no cached warehouse), and run the replication check that emits results/replication_verdict.json.
Invoke the rebuild headlessly with claude -p … --output-format json and confirm the envelope parses — final message, session id, cost — and that the verdict field reads pass.
Add .github/workflows/gates.yml on claude-code-action: on any PR touching src/, run the C2 contract suite and the D4 referee headlessly; fail the check on either verdict.
Open a throwaway PR that breaks a contract on purpose and confirm the gate goes red — a gate you have not watched fail is a gate you do not have.
File the run in journal/: the verdict, the cost, and one thing the fresh clone needed that your laptop had been quietly supplying. Then make check-e1.

Codex

Write scripts/replicate.sh: clone the repo into a temp dir, rebuild the warehouse from scripts alone (no cached warehouse), and run the replication check that emits results/replication_verdict.json.
Invoke the rebuild headlessly with codex exec … --json and confirm the envelope parses — final message, thread id, cost — and that the verdict field reads pass. Resume it once with exec resume to prove the run is a continuable function.
Add .github/workflows/gates.yml on the Codex Action: on any PR touching src/, run the C2 contract suite and the D4 referee headlessly; fail the check on either verdict.
Open a throwaway PR that breaks a contract on purpose and confirm the gate goes red — a gate you have not watched fail is a gate you do not have.
File the run in journal/: the verdict, the cost, and one thing the fresh clone needed that your laptop had been quietly supplying. Then make check-e1.

make check-e1 verifies three artifacts: replicate.sh produces a pass verdict from a genuinely fresh clone, the gate workflow exists and references both the contract suite and the referee, and the deliberate contract break turns the gate red. This is what E2 schedules and what F1 ships as make replicate.

Milestone gate · make check-e1advances E1

scripts/replicate.sh rebuilds the analysis in a genuinely fresh clone (no cached warehouse)
The clone is the clean room — it must trust nothing the laptop was supplying by hand.
The headless run emits a parseable structured envelope and a `pass` replication verdict
Pin the fields your harness reads — structured output without a schema is parsing roulette.
.github/workflows/gates.yml runs the C2 contract suite and the D4 referee headlessly on any PR touching src/
A deliberate contract break turns the gate red — a gate you have not watched fail is a gate you do not have

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Headless runs resolve config differently than interactive ones — different working directory, no shell profile, none of your environment’s quiet conveniences. The clean-room rebuild that passes on your laptop and fails in CI has usually found a real dependency you were supplying by hand. Test the headless path explicitly, in a fresh clone, before you trust the verdict — the divergence is the finding, not a nuisance.
[both] 〜〜

Structured output without a schema is parsing roulette. “Return JSON” with no contract returns a JSON shape — a different one when the run goes sideways and apologizes in prose instead. Pin the fields your harness reads, validate them, and fail closed when the shape is wrong. A replication verdict you cannot parse is a replication you did not run.
[both]

CI gates that only warn train contributors to ignore them — C2’s lesson, again, one level up. A referee comment that never blocks a merge is read once and then never; a check that goes red and stops the button is read every time. Integrity gates block: the exit code is the verdict, the same as it was for hooks.
[CX]

A resumed run inherits the prior run’s state, not a clean slate — resuming is for continuing an interrupted job, never for re-asking a question you want answered fresh. If the replication self-test must trust nothing, it starts a new run in a new clone, not a resume of the one that already saw the answer.

Check Your Bearings

E1 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Record your first headless replication run: the verdict, the run cost, and the one dependency the fresh clone exposed that your laptop had been supplying silently.

as of June 2026

Headless invocation is genuine parity: both tools expose a print/exec mode that takes a prompt and emits a structured JSON envelope, both offer a resumable session by id, and both ship a GitHub Action that runs the agent inside a CI runner. The SDKs are where the surfaces diverge — Claude Code’s Agent SDK ships for both Python and TypeScript, while the Codex SDK is TypeScript-only as of this date, which matters if your replication harness wants to live next to your Python analysis code. The on-demand PR-review trigger (mention the reviewer on a pull request) is a Codex convenience with no exact Claude Code analogue; the scheduled workflow gate is symmetric. See the parity matrix for the dated detail.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

The Pain

Why / When

Mechanics

A function has a contract

✳ Claude Code

⬡ Codex

The reproducibility self-test

CI gates: the function as a merge gate

✳ Claude Code

⬡ Codex

The gates are the same suite

Guided Run — The Clean Room

Guided Run — The Clean Room

✳ Claude Code

⬡ Codex

Pitfalls & Gotchas

Parity note

Claude Code

Codex

Claude Code

Codex

Claude Code

Codex