C2 intermediate ~60 min

Rules That Enforce Themselves

Absorbs: the data manager / QA who never sleeps

Advances C2

The Pain

It’s week six. The February file arrived on schedule, the download script verified its checksum, and the ingest ran green: 2.9 million rows, exit code zero. Somewhere inside that file, a column had changed its name — by a single capital letter. Your pipeline, asked for airport_fee and finding nothing under that spelling, did what pipelines do — filled the column with nulls and moved on. Nothing crashed. The dashboard refreshed. The elasticity estimates updated, a little smoother than before, because the panel had quietly lost JFK and LaGuardia — the two places where weather bites hardest.

For eleven days the numbers drifted. You presented on a Tuesday. Your adviser asked, mildly, why the airport zones had gone dark in the heterogeneity table, and you heard yourself say you’d look into it, in the voice of someone whose next three evenings are already spoken for. The post-mortem was short and humiliating: the rename was documented. It sat in a release note the whole time.

A real lab keeps a person for this. The data manager re-checks every load before it counts, trusts no file twice, and reads release notes the morning they appear. The role demands nothing brilliant — only that someone is always looking. Your lab does not have that person. Your lab has you, and that week you were busy being the methodologist, and the week before that, the RA. The vigilance isn’t difficult. It’s permanent. And permanence is exactly what you cannot staff.

Why / When

A hook is a rule that runs automatically at a fixed point in the agent’s working loop — before a tool call, after a write, at the start of a session — and can block what it doesn’t like. This completes the unit’s taxonomy:

Mechanism	Nature	Runs
instruction file	always-on context	every session, as advice
skill	on-demand procedure	when invoked
hook	enforced rule	always, mechanically

The distinction that matters is prompted versus enforced. A prompted guideline — “please validate schemas before writing” — is advice the model weighs against everything else it is juggling, and under pressure advice loses. C1’s cleaning skill carries your validation logic, but a skill runs when someone remembers to invoke it. A hook runs every time, on a trigger you chose, with the authority to reject the result. Advice versus law.

In the research pipeline, hooks guard the ingestion and transformation stages — the places where silent corruption enters. The lab role they absorb is the data manager: permanent vigilance replaced by a few lines of configuration that run on every write, at 2 p.m. or 2 a.m., tired or not.

Contrary winds

Not for: one-off scripts you'll delete tomorrow — a hook outlives the session, so don't legislate throwaway work.

Mechanics

Both tools implement the same hook model with different configuration surfaces. The shared mechanics first; the dialects below.

The hook model

A hook binds three things: an event in the agent’s lifecycle, an optional matcher narrowing which tool calls it applies to, and a command — your script, which receives a JSON description of the tool call on stdin and answers with an exit code. The event vocabulary is the same in both tools:

PreToolUse — fires before a tool call runs; can veto it.
PostToolUse — fires after; can reject the result and report why.
SessionStart — fires when a session opens; its stdout becomes context the agent reads (a briefing, not a gate).
SubagentStop / Stop — fire when a subagent or the session winds down; the natural place for end-of-run audits.

System Player film — Hook Lifecycle

step 1/7

Step 1 of 7.

The agent is mid-loop, working on the cleaning pipeline. The rules it works under are not in its head — they are a hooks block committed to .claude/settings.json, so they arrive with the clone.

json

{  "hooks": {    "PostToolUse": [{      "matcher": "Write|Edit",      "hooks": [{ "type": "command",        "command": "python scripts/hooks/on_transform_change.py" }]    }]  }}

Blocking vs advisory

Exit codes are verdicts. Exit 0 is a pass: silence, the work proceeds. Exit 2 is the blocking verdict: before the tool runs, the call is stopped cold; after it runs, the agent is halted on the spot — the write happened, but it cannot be built on until the failure is addressed — and everything your script printed to stderr is fed back into the agent’s context, so the gate and the engineer talk to each other. Any other non-zero exit is advisory: surfaced to you, ignored by the machinery. The C2 rule of thumb: integrity gates block; everything else is a comment.

Where hooks live

In Claude Code, hooks are a hooks block in .claude/settings.json (user, project, and local layers merge). In Codex, they live in hooks.json at the repo root or the [hooks] table of config.toml, stable since v0.124 (as of June 2026). Same events, same verdicts — only the surface differs.

Four research-native recipes

The same four rules, one per event pattern, in either dialect. No linter examples here — these guard data:

Raw is read-only — PreToolUse blocks any write into data/raw/ (B3’s permission rule, doubled at a different layer; belt and suspenders are a research instrument).
Transforms are contracted — PostToolUse runs the validation suite whenever anything under src/transforms/ changes.
Sessions open with a briefing — SessionStart prints the data version hash and per-table row counts, so every session starts knowing what the warehouse holds.
Row deltas are bounded — PostToolUse checks that a warehouse write changed row counts by a plausible amount; a month of yellow cabs is millions of rows, not forty and not forty million.

Claude Code

All four recipes are one hooks block in the project layer of .claude/settings.json — committed, so the rules arrive with the clone:

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Write|Edit",
      "hooks": [{ "type": "command",
        "command": "python scripts/hooks/guard_raw.py" }]
    }],
    "PostToolUse": [{
      "matcher": "Write|Edit",
      "hooks": [{ "type": "command",
        "command": "python scripts/hooks/on_transform_change.py" }]
    }, {
      "matcher": "Bash",
      "hooks": [{ "type": "command",
        "command": "python scripts/hooks/bound_row_deltas.py" }]
    }],
    "SessionStart": [{
      "hooks": [{ "type": "command",
        "command": "python scripts/hooks/session_brief.py" }]
    }]
  }
}

The walkthrough, recipe by recipe:

guard_raw.py (recipe ①) reads the tool-call JSON from stdin, extracts tool_input.file_path, and exits 2 with a one-line stderr if the path resolves under data/raw/. The matcher "Write|Edit" scopes it to file-editing tools; shell writes are already fenced by B3’s permission rules — two layers, deliberately.
on_transform_change.py (recipe ②) checks whether the edited path is under src/transforms/ and, if so, runs scripts/validate_contracts.py (the suite below) against the warehouse. Anything else exits 0 immediately — the scoping lives in the script, so ls never pays for it.
session_brief.py (recipe ③) prints data/raw/SHA256SUMS’s digest and SELECT month, count(*) FROM trips_raw GROUP BY 1 to stdout. SessionStart stdout becomes context: the agent starts each session knowing exactly which data version it is standing on.
bound_row_deltas.py (recipe ④) compares post-write row counts against a ledger kept in results/row_ledger.json and exits 2 when a delta falls outside per-table bounds.

One realistic extension, and it is Claude Code’s alone: hooks whose verdict comes from a model rather than a script. A prompt-based hook can gate on judgment calls — “does this edit change the sample definition?” — where no regex can. Treat them as a preview of D4’s referee: enforcement for rules, adversarial judgment for everything rules can’t name.

Codex

The same four recipes declare in hooks.json at the repo root, or equivalently in the [hooks] table of config.toml. The TOML form, in the project-layer team config so the rules arrive with the clone:

[hooks]
session_start = [
  { command = "python scripts/hooks/session_brief.py" },
]
pre_tool_use = [
  { matcher = "write", command = "python scripts/hooks/guard_raw.py" },
]
post_tool_use = [
  { matcher = "write", command = "python scripts/hooks/on_transform_change.py" },
  { matcher = "shell", command = "python scripts/hooks/bound_row_deltas.py" },
]

The event vocabulary is identical to Claude Code’s — PreToolUse, PostToolUse, SessionStart, SubagentStop, Stop — written in TOML’s snake_case. The verdict protocol is identical too: the hook command reads the tool-call JSON on stdin, exit 0 passes, exit 2 blocks and returns stderr to the agent. The four scripts are the same files; the dialect difference is confined entirely to this table.

guard_raw.py (recipe ①) vetoes file writes under data/raw/; the matcher = "write" entry scopes it to file-editing tools, while shell stays fenced by the B3 sandbox profile.
on_transform_change.py (recipe ②) self-scopes to src/transforms/ paths and runs the contract suite on a match.
session_brief.py (recipe ③) prints the data-version digest and row counts; session_start stdout is read into the agent’s context.
bound_row_deltas.py (recipe ④) bounds warehouse row deltas after shell writes.

The extension that matters here is layering: know which config layer owns hooks. Project-layer declarations and your personal ~/.codex/config.toml merge per event — entries from every layer run. A repo-shipped contract gate therefore cannot be silently disabled by a user-level table that never mentions post_tool_use; but the reverse also holds, and the session_start briefing you forgot you declared globally will follow you into every project until you go looking for it.

Watch this space as of 2026-06

Codex hooks stabilized at v0.124; earlier releases shipped them behind a flag with a narrower event list. Pin your CLI version and recheck quarterly.

The validation suite is yours

The hook is nine lines of configuration; the judgment lives in your own validation suite, versioned with the project like any other methodology. The contract checks — expected schema, null-rate ceilings, row-delta bounds — are statistics, so they come in both of the lab’s languages:

Python

import sys
import duckdb

EXPECTED = {"airport_fee": "DOUBLE", "tpep_pickup_datetime": "TIMESTAMP"}
NULL_CEILING = {"airport_fee": 0.05}    # the JFK signal must stay visible
ROW_BOUNDS = (2_000_000, 4_800_000)     # a plausible yellow-cab month

con = duckdb.connect("warehouse.duckdb", read_only=True)
cols = {name: dtype for name, dtype, *_ in con.sql("DESCRIBE trips_raw").fetchall()}
failures = [f"column {c}: expected {t}, found {cols.get(c)}"
            for c, t in EXPECTED.items() if cols.get(c) != t]

LATEST = "month = (SELECT max(month) FROM trips_raw)"
for col, ceiling in NULL_CEILING.items():
    if col in cols:
        rate = con.sql(
            f"SELECT avg(({col} IS NULL)::INT) FROM trips_raw WHERE {LATEST}"
        ).fetchone()[0]
        if rate > ceiling:
            failures.append(f"{col}: {rate:.1%} null in latest month "
                            f"(ceiling {ceiling:.0%})")

n = con.sql(f"SELECT count(*) FROM trips_raw WHERE {LATEST}").fetchone()[0]
if not ROW_BOUNDS[0] <= n <= ROW_BOUNDS[1]:
    failures.append(f"latest month has {n:,} rows — outside {ROW_BOUNDS}")

if failures:
    print("FAILED trips_raw:\n  " + "\n  ".join(failures), file=sys.stderr)
    sys.exit(2)  # the blocking verdict: reject the write, report to the agent

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

library(duckdb)

expected <- c(airport_fee = "DOUBLE", tpep_pickup_datetime = "TIMESTAMP")
null_ceiling <- c(airport_fee = 0.05)   # the JFK signal must stay visible
row_bounds <- c(2e6, 4.8e6)             # a plausible yellow-cab month

con <- dbConnect(duckdb(), "warehouse.duckdb", read_only = TRUE)
cols <- dbGetQuery(con, "DESCRIBE trips_raw")
types <- setNames(cols$column_type, cols$column_name)

failures <- character(0)
got <- types[names(expected)]
bad <- names(expected)[is.na(got) | got != expected]
if (length(bad) > 0)
  failures <- c(failures, sprintf("column %s: expected %s, found %s",
                                  bad, expected[bad], got[bad]))

latest <- "month = (SELECT max(month) FROM trips_raw)"
for (col in names(null_ceiling)) {
  if (col %in% names(types)) {
    rate <- dbGetQuery(con, sprintf(
      "SELECT avg((%s IS NULL)::INT) FROM trips_raw WHERE %s", col, latest))[[1]]
    if (rate > null_ceiling[[col]])
      failures <- c(failures,
                    sprintf("%s: %.1f%% null in latest month", col, 100 * rate))
  }
}

n <- dbGetQuery(con, sprintf(
  "SELECT count(*) FROM trips_raw WHERE %s", latest))[[1]]
if (n < row_bounds[1] || n > row_bounds[2])
  failures <- c(failures, sprintf("latest month has %s rows — out of bounds",
                                  format(n, big.mark = ",")))

if (length(failures) > 0) {
  message("FAILED trips_raw:\n  ", paste(failures, collapse = "\n  "))
  quit(status = 2) # the blocking verdict: reject the write, report to the agent
}

Note what the ceilings encode: not “no nulls” (real taxi data has nulls) but “the airport fee cannot vanish.” Contracts are research judgments with teeth — the rationale lines from C1’s skills, promoted to law.

The review bench

Before the machinery does it for you, sit at the data manager’s desk once. Below are the two parquet footers from the Pain vignette — the January file that ran green and the February file that just arrived. Somewhere in that listing is the drift. You get three filings; the bench keeps score honestly, and missing is instructive — you will know exactly what you are asking a hook to never miss.

Review benchThe February footers, side by side

Two parquet footers: January 2023 (loaded last month, ran green) and the February file that just arrived. One of these lines is eleven days of silent drift. Mark the suspect line and file your suspicion.

$ python scripts/probe_schemas.py --months 2023-01 2023-02 (footer range reads, ~64 KB each — no downloads)

The review bench needs JavaScript — it withholds an answer until you commit to a guess, which static HTML cannot do. The lesson text covers everything the bench rehearses.

Now watch the law fire. The guided run below replays the February incident with the gate installed: the ingest reports success, the hook reads the warehouse, and the rename that cost you eleven days in the Pain vignette is caught in roughly twenty seconds.

Schema drift, caught in the wild — TLC renamed airport_fee → Airport_fee between the 2023-01 and 2023-02 yellow files (dtype unchanged: double). A naïve name-based merge silently NULLs the airport signal for every later row.

Guided Run — Saved by the Hook

Field Terminal — session: c2-hook-drift Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make check-c2 passes

Put the cleaning pipeline under contract, then prove the contract holds. You will install the four recipes, run C1’s cleaning procedure across all 24 months, and walk into the planted bug on purpose: the real TLC schema drift in data/raw/yellow_2023-02.parquet — the casing flip, one month into the working window.

Claude Code

Commit the four-recipe hooks block to the project layer of .claude/settings.json, alongside scripts/hooks/ and the validation suite.
Open a fresh session and confirm recipe ③ fires: the data-version digest and row counts appear before you type anything.
Run /clean-trips across all 24 months and produce the filter cascade table — row counts surviving each documented filter, the audit trail that becomes the report’s data appendix.
When the February file hits the contract gate, read the hook’s stderr in the transcript, approve the agent’s rename-map fix in src/transforms/standardize.py, and re-run to green.
File the incident in journal/2023-02-drift.md: cause, what caught it, time-to-detection. Then make check-c2.

Codex

Commit the four-recipe [hooks] table (or hooks.json) to the project layer, alongside scripts/hooks/ and the validation suite.
Open a fresh session and confirm recipe ③ fires: the data-version digest and row counts appear before you type anything.
Run $clean-trips across all 24 months and produce the filter cascade table — row counts surviving each documented filter, the audit trail that becomes the report’s data appendix.
When the February file hits the contract gate, read the hook’s stderr in the transcript, approve the agent’s rename-map fix in src/transforms/standardize.py, and re-run to green.
File the incident in journal/2023-02-drift.md: cause, what caught it, time-to-detection. Then make check-c2.

make check-c2 verifies three artifacts: the contracts pass over all 24 months, the cascade table exists with monotonically non-increasing counts, and the incident log records the drift. This is what C3 inherits — the warehouse panel is only worth plumbing because every write that built it passed this gate.

The filter cascade, with receipts — 10,129,347 raw rows → 9,961,866 clean across the three months; each documented filter's removal counted and shown in rubric.

step	id	label	rationale	rows_remaining	rows_removed
0	raw	All rows in the three monthly files	∅	10,129,347	0
1	in_month	Pickup inside the file's own calendar month	TLC monthly files contain stray rows dated years away (meter clock faults); they would land in the wrong panel cell or outside the panel.	10,129,258	89
2	fare_nonneg	fare_amount >= 0	Negative fares are refunds, disputes, or voided meters -- not demand.	9,968,476	160,782
3	duration_pos	dropoff strictly after pickup	Zero- and negative-duration trips are meter artifacts; they cannot be completed trips and corrupt every speed or duration measure.	9,965,415	3,061
4	speed_le_65	implied speed <= 65 mph	distance/duration above highway speed is physically implausible inside NYC; odometer or timestamp errors.	9,961,866	3,549

Milestone gate · make check-c2advances C2

All four hook recipes installed in both dialects and committed with scripts/hooks/
Raw-write guard, transform-change validation, session briefing, row-delta bounds.
The cleaning pipeline ran across all 24 months under the contract gate
The filter cascade table exists with monotonically non-increasing counts
Counts surviving each documented filter — the report's data appendix starts here.
The 2023-02 re-ingest was BLOCKED by the contract hook (exit 2), not warned about
The hook must block on failure; a warning is the bug this lesson exists to kill.
journal/2023-02-drift.md records cause, detection, and time-to-detection

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Hooks that warn instead of block get ignored — by you and by the agent. A schema-drift warning scrolled past at 2 a.m. is how airport_fee becomes 100% null and your estimates quietly lose the airport zones. Integrity gates must block: exit 2 is the only verdict that protects the panel.
[both] 〜〜

Validation in the skill is not validation as a hook. C1’s cleaning skill carries the same checks, but a skill runs when invoked and the months you ingest in a hurry are exactly the months nobody invokes it. The skill encodes the procedure; the hook removes the option of skipping it.
[both]

A full test suite on every edit makes the agent glacial — it will pay your contract suite’s runtime on a README typo. Scope tightly: matchers narrow the tool, and the first line of your hook script should be the cheap path-check that exits 0.
[CX]

Know which config layer owns hooks. Project and user layers merge per event, so the gate you think you’re testing may be accompanied by a global entry you declared three projects ago — and the merged order is defined, not intuitive. Audit the effective table, not the file you happen to have open.

Check Your Bearings

C2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Log the incident: what failed, what caught it, time-to-detection — and what the same bug cost you the last time nothing was watching.

as of June 2026

Hooks are a genuine parity feature: both tools speak the same event vocabulary (PreToolUse, PostToolUse, SessionStart, SubagentStop, Stop) and the same exit-code verdicts, differing only in configuration surface — settings layers on one side, TOML tables on the other, stable in Codex since v0.124. The real asymmetry is at the top end: Claude Code additionally offers prompt- and agent-based hooks, where the verdict comes from a model judging the action rather than a script matching it. Codex has no equivalent yet.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

Rules That Enforce Themselves

The Pain

Why / When

Mechanics

The hook model

Blocking vs advisory

Where hooks live

Four research-native recipes

Claude Code

Codex

The validation suite is yours

Python

R

The review bench

remote_footer_probe 26 months probed · → out/schema_probe.json

Guided Run — Saved by the Hook

Field Assignment

Claude Code

Codex

filter_cascade

Pitfalls & Gotchas

Check Your Bearings

Ledger — C2

The Lab Roster

Your position

Positions

Running Totals

The Pain

Why / When

Mechanics

The hook model

Blocking vs advisory

Where hooks live

Four research-native recipes

✳ Claude Code

⬡ Codex

The validation suite is yours

Python

R

The review bench

remote_footer_probe 26 months probed · → out/schema_probe.json

Guided Run — Saved by the Hook

✳ Claude Code

⬡ Codex

filter_cascade

Pitfalls & Gotchas

Parity note

Claude Code

Codex

Claude Code

Codex