The Pain
It’s week six. The February file arrived on schedule, the download script
verified its checksum, and the ingest ran green: 2.9 million rows, exit
code zero. Somewhere inside that file, a column had changed its name —
by a single capital letter. Your pipeline, asked for airport_fee and
finding nothing under that spelling, did what pipelines do — filled the
column with nulls and moved on. Nothing crashed.
The dashboard refreshed. The elasticity estimates updated, a little
smoother than before, because the panel had quietly lost JFK and
LaGuardia — the two places where weather bites hardest.
For eleven days the numbers drifted. You presented on a Tuesday. Your adviser asked, mildly, why the airport zones had gone dark in the heterogeneity table, and you heard yourself say you’d look into it, in the voice of someone whose next three evenings are already spoken for. The post-mortem was short and humiliating: the rename was documented. It sat in a release note the whole time.
A real lab keeps a person for this. The data manager re-checks every load before it counts, trusts no file twice, and reads release notes the morning they appear. The role demands nothing brilliant — only that someone is always looking. Your lab does not have that person. Your lab has you, and that week you were busy being the methodologist, and the week before that, the RA. The vigilance isn’t difficult. It’s permanent. And permanence is exactly what you cannot staff.
Why / When
A hook is a rule that runs automatically at a fixed point in the agent’s working loop — before a tool call, after a write, at the start of a session — and can block what it doesn’t like. This completes the unit’s taxonomy:
| Mechanism | Nature | Runs |
|---|---|---|
| instruction file | always-on context | every session, as advice |
| skill | on-demand procedure | when invoked |
| hook | enforced rule | always, mechanically |
The distinction that matters is prompted versus enforced. A prompted guideline — “please validate schemas before writing” — is advice the model weighs against everything else it is juggling, and under pressure advice loses. C1’s cleaning skill carries your validation logic, but a skill runs when someone remembers to invoke it. A hook runs every time, on a trigger you chose, with the authority to reject the result. Advice versus law.
In the research pipeline, hooks guard the ingestion and transformation stages — the places where silent corruption enters. The lab role they absorb is the data manager: permanent vigilance replaced by a few lines of configuration that run on every write, at 2 p.m. or 2 a.m., tired or not.
Contrary winds
Not for: one-off scripts you'll delete tomorrow — a hook outlives the session, so don't legislate throwaway work.
Mechanics
Both tools implement the same hook model with different configuration surfaces. The shared mechanics first; the dialects below.
The hook model
A hook binds three things: an event in the agent’s lifecycle, an optional matcher narrowing which tool calls it applies to, and a command — your script, which receives a JSON description of the tool call on stdin and answers with an exit code. The event vocabulary is the same in both tools:
- PreToolUse — fires before a tool call runs; can veto it.
- PostToolUse — fires after; can reject the result and report why.
- SessionStart — fires when a session opens; its stdout becomes context the agent reads (a briefing, not a gate).
- SubagentStop / Stop — fire when a subagent or the session winds down; the natural place for end-of-run audits.
Step 1 of 7.
The agent is mid-loop, working on the cleaning pipeline. The rules it works under are not in its head — they are a hooks block committed to .claude/settings.json, so they arrive with the clone.
{ "hooks": { "PostToolUse": [{ "matcher": "Write|Edit", "hooks": [{ "type": "command", "command": "python scripts/hooks/on_transform_change.py" }] }] }}Blocking vs advisory
Exit codes are verdicts. Exit 0 is a pass: silence, the work proceeds. Exit 2 is the blocking verdict: before the tool runs, the call is stopped cold; after it runs, the agent is halted on the spot — the write happened, but it cannot be built on until the failure is addressed — and everything your script printed to stderr is fed back into the agent’s context, so the gate and the engineer talk to each other. Any other non-zero exit is advisory: surfaced to you, ignored by the machinery. The C2 rule of thumb: integrity gates block; everything else is a comment.
Where hooks live
In Claude Code, hooks are a hooks block in .claude/settings.json
(user, project, and local layers merge). In Codex, they live in
hooks.json at the repo root or the [hooks] table of config.toml,
stable since v0.124 (as of June 2026). Same events, same verdicts —
only the surface differs.
Four research-native recipes
The same four rules, one per event pattern, in either dialect. No linter examples here — these guard data:
- Raw is read-only — PreToolUse blocks any write into
data/raw/(B3’s permission rule, doubled at a different layer; belt and suspenders are a research instrument). - Transforms are contracted — PostToolUse runs the validation
suite whenever anything under
src/transforms/changes. - Sessions open with a briefing — SessionStart prints the data version hash and per-table row counts, so every session starts knowing what the warehouse holds.
- Row deltas are bounded — PostToolUse checks that a warehouse write changed row counts by a plausible amount; a month of yellow cabs is millions of rows, not forty and not forty million.
Claude Code
All four recipes are one hooks block in the project layer of
.claude/settings.json — committed, so the rules arrive with the clone:
{ "hooks": { "PreToolUse": [{ "matcher": "Write|Edit", "hooks": [{ "type": "command", "command": "python scripts/hooks/guard_raw.py" }] }], "PostToolUse": [{ "matcher": "Write|Edit", "hooks": [{ "type": "command", "command": "python scripts/hooks/on_transform_change.py" }] }, { "matcher": "Bash", "hooks": [{ "type": "command", "command": "python scripts/hooks/bound_row_deltas.py" }] }], "SessionStart": [{ "hooks": [{ "type": "command", "command": "python scripts/hooks/session_brief.py" }] }] }}The walkthrough, recipe by recipe:
guard_raw.py(recipe ①) reads the tool-call JSON from stdin, extractstool_input.file_path, and exits 2 with a one-line stderr if the path resolves underdata/raw/. The matcher"Write|Edit"scopes it to file-editing tools; shell writes are already fenced by B3’s permission rules — two layers, deliberately.on_transform_change.py(recipe ②) checks whether the edited path is undersrc/transforms/and, if so, runsscripts/validate_contracts.py(the suite below) against the warehouse. Anything else exits 0 immediately — the scoping lives in the script, solsnever pays for it.session_brief.py(recipe ③) printsdata/raw/SHA256SUMS’s digest andSELECT month, count(*) FROM trips_raw GROUP BY 1to stdout. SessionStart stdout becomes context: the agent starts each session knowing exactly which data version it is standing on.bound_row_deltas.py(recipe ④) compares post-write row counts against a ledger kept inresults/row_ledger.jsonand exits 2 when a delta falls outside per-table bounds.
One realistic extension, and it is Claude Code’s alone: hooks whose verdict comes from a model rather than a script. A prompt-based hook can gate on judgment calls — “does this edit change the sample definition?” — where no regex can. Treat them as a preview of D4’s referee: enforcement for rules, adversarial judgment for everything rules can’t name.
Codex
The same four recipes declare in hooks.json at the repo root, or
equivalently in the [hooks] table of config.toml. The TOML form,
in the project-layer team config so the rules arrive with the clone:
[hooks]session_start = [ { command = "python scripts/hooks/session_brief.py" },]pre_tool_use = [ { matcher = "write", command = "python scripts/hooks/guard_raw.py" },]post_tool_use = [ { matcher = "write", command = "python scripts/hooks/on_transform_change.py" }, { matcher = "shell", command = "python scripts/hooks/bound_row_deltas.py" },]The event vocabulary is identical to Claude Code’s — PreToolUse, PostToolUse, SessionStart, SubagentStop, Stop — written in TOML’s snake_case. The verdict protocol is identical too: the hook command reads the tool-call JSON on stdin, exit 0 passes, exit 2 blocks and returns stderr to the agent. The four scripts are the same files; the dialect difference is confined entirely to this table.
guard_raw.py(recipe ①) vetoes file writes underdata/raw/; thematcher = "write"entry scopes it to file-editing tools, while shell stays fenced by the B3 sandbox profile.on_transform_change.py(recipe ②) self-scopes tosrc/transforms/paths and runs the contract suite on a match.session_brief.py(recipe ③) prints the data-version digest and row counts; session_start stdout is read into the agent’s context.bound_row_deltas.py(recipe ④) bounds warehouse row deltas after shell writes.
The extension that matters here is layering: know which config
layer owns hooks. Project-layer declarations and your personal
~/.codex/config.toml merge per event — entries from every layer
run. A repo-shipped contract gate therefore cannot be silently disabled
by a user-level table that never mentions post_tool_use; but the
reverse also holds, and the session_start briefing you forgot you
declared globally will follow you into every project until you go
looking for it.
Codex hooks stabilized at v0.124; earlier releases shipped them behind a flag with a narrower event list. Pin your CLI version and recheck quarterly.
The validation suite is yours
The hook is nine lines of configuration; the judgment lives in your own validation suite, versioned with the project like any other methodology. The contract checks — expected schema, null-rate ceilings, row-delta bounds — are statistics, so they come in both of the lab’s languages:
Python
import sysimport duckdb
EXPECTED = {"airport_fee": "DOUBLE", "tpep_pickup_datetime": "TIMESTAMP"}NULL_CEILING = {"airport_fee": 0.05} # the JFK signal must stay visibleROW_BOUNDS = (2_000_000, 4_800_000) # a plausible yellow-cab month
con = duckdb.connect("warehouse.duckdb", read_only=True)cols = {name: dtype for name, dtype, *_ in con.sql("DESCRIBE trips_raw").fetchall()}failures = [f"column {c}: expected {t}, found {cols.get(c)}" for c, t in EXPECTED.items() if cols.get(c) != t]
LATEST = "month = (SELECT max(month) FROM trips_raw)"for col, ceiling in NULL_CEILING.items(): if col in cols: rate = con.sql( f"SELECT avg(({col} IS NULL)::INT) FROM trips_raw WHERE {LATEST}" ).fetchone()[0] if rate > ceiling: failures.append(f"{col}: {rate:.1%} null in latest month " f"(ceiling {ceiling:.0%})")
n = con.sql(f"SELECT count(*) FROM trips_raw WHERE {LATEST}").fetchone()[0]if not ROW_BOUNDS[0] <= n <= ROW_BOUNDS[1]: failures.append(f"latest month has {n:,} rows — outside {ROW_BOUNDS}")
if failures: print("FAILED trips_raw:\n " + "\n ".join(failures), file=sys.stderr) sys.exit(2) # the blocking verdict: reject the write, report to the agentThis block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).
R
library(duckdb)
expected <- c(airport_fee = "DOUBLE", tpep_pickup_datetime = "TIMESTAMP")null_ceiling <- c(airport_fee = 0.05) # the JFK signal must stay visiblerow_bounds <- c(2e6, 4.8e6) # a plausible yellow-cab month
con <- dbConnect(duckdb(), "warehouse.duckdb", read_only = TRUE)cols <- dbGetQuery(con, "DESCRIBE trips_raw")types <- setNames(cols$column_type, cols$column_name)
failures <- character(0)got <- types[names(expected)]bad <- names(expected)[is.na(got) | got != expected]if (length(bad) > 0) failures <- c(failures, sprintf("column %s: expected %s, found %s", bad, expected[bad], got[bad]))
latest <- "month = (SELECT max(month) FROM trips_raw)"for (col in names(null_ceiling)) { if (col %in% names(types)) { rate <- dbGetQuery(con, sprintf( "SELECT avg((%s IS NULL)::INT) FROM trips_raw WHERE %s", col, latest))[[1]] if (rate > null_ceiling[[col]]) failures <- c(failures, sprintf("%s: %.1f%% null in latest month", col, 100 * rate)) }}
n <- dbGetQuery(con, sprintf( "SELECT count(*) FROM trips_raw WHERE %s", latest))[[1]]if (n < row_bounds[1] || n > row_bounds[2]) failures <- c(failures, sprintf("latest month has %s rows — out of bounds", format(n, big.mark = ",")))
if (length(failures) > 0) { message("FAILED trips_raw:\n ", paste(failures, collapse = "\n ")) quit(status = 2) # the blocking verdict: reject the write, report to the agent}Note what the ceilings encode: not “no nulls” (real taxi data has nulls) but “the airport fee cannot vanish.” Contracts are research judgments with teeth — the rationale lines from C1’s skills, promoted to law.
The review bench
Before the machinery does it for you, sit at the data manager’s desk once. Below are the two parquet footers from the Pain vignette — the January file that ran green and the February file that just arrived. Somewhere in that listing is the drift. You get three filings; the bench keeps score honestly, and missing is instructive — you will know exactly what you are asking a hook to never miss.
Two parquet footers: January 2023 (loaded last month, ran green) and the February file that just arrived. One of these lines is eleven days of silent drift. Mark the suspect line and file your suspicion.
$ python scripts/probe_schemas.py --months 2023-01 2023-02 (footer range reads, ~64 KB each — no downloads)
Now watch the law fire. The guided run below replays the February incident with the gate installed: the ingest reports success, the hook reads the warehouse, and the rename that cost you eleven days in the Pain vignette is caught in roughly twenty seconds.
the numbers behind this figure
remote_footer_probe 26 months probed · → out/schema_probe.json
(pyarrow ParquetFile over HTTP range reads — src/probe_schemas.py; ~64 KB/month, no full downloads) honesty note REAL drift: the flip is 2023-01 → 2023-02, not the course's 2024-03 placeholder — our local 2024-02 and 2024-03 files both already carry Airport_fee. Null rates for 2022-12 (3.7%) and 2023-01 (2.3%) come from parquet footer statistics; the 100% bars for 2023-02/03 are definitional (the lowercase column is absent from those files, so a name-based merge yields NULL for all 2,913,955 / 3,403,766 rows). Nothing reconstructed.
Guided Run — Saved by the Hook
claudeField Assignment
Artifact make check-c2 passes
Put the cleaning pipeline under contract, then prove the contract holds.
You will install the four recipes, run C1’s cleaning procedure across
all 24 months, and walk into the planted bug on purpose: the real TLC
schema drift in data/raw/yellow_2023-02.parquet — the casing flip,
one month into the working window.
Claude Code
- Commit the four-recipe
hooksblock to the project layer of.claude/settings.json, alongsidescripts/hooks/and the validation suite. - Open a fresh session and confirm recipe ③ fires: the data-version digest and row counts appear before you type anything.
- Run
/clean-tripsacross all 24 months and produce the filter cascade table — row counts surviving each documented filter, the audit trail that becomes the report’s data appendix. - When the February file hits the contract gate, read the hook’s stderr
in the transcript, approve the agent’s rename-map fix in
src/transforms/standardize.py, and re-run to green. - File the incident in
journal/2023-02-drift.md: cause, what caught it, time-to-detection. Thenmake check-c2.
Codex
- Commit the four-recipe
[hooks]table (orhooks.json) to the project layer, alongsidescripts/hooks/and the validation suite. - Open a fresh session and confirm recipe ③ fires: the data-version digest and row counts appear before you type anything.
- Run
$clean-tripsacross all 24 months and produce the filter cascade table — row counts surviving each documented filter, the audit trail that becomes the report’s data appendix. - When the February file hits the contract gate, read the hook’s stderr
in the transcript, approve the agent’s rename-map fix in
src/transforms/standardize.py, and re-run to green. - File the incident in
journal/2023-02-drift.md: cause, what caught it, time-to-detection. Thenmake check-c2.
make check-c2 verifies three artifacts: the contracts pass over all
24 months, the cascade table exists with monotonically non-increasing
counts, and the incident log records the drift. This is what C3
inherits — the warehouse panel is only worth plumbing because every
write that built it passed this gate.
the numbers behind this figure
filter_cascade
SELECT * FROM filter_cascade ORDER BY step | step | id | label | rationale | rows_remaining | rows_removed |
|---|---|---|---|---|---|
| 0 | raw | All rows in the three monthly files | ∅ | 10,129,347 | 0 |
| 1 | in_month | Pickup inside the file's own calendar month | TLC monthly files contain stray rows dated years away (meter clock faults); they would land in the wrong panel cell or outside the panel. | 10,129,258 | 89 |
| 2 | fare_nonneg | fare_amount >= 0 | Negative fares are refunds, disputes, or voided meters -- not demand. | 9,968,476 | 160,782 |
| 3 | duration_pos | dropoff strictly after pickup | Zero- and negative-duration trips are meter artifacts; they cannot be completed trips and corrupt every speed or duration measure. | 9,965,415 | 3,061 |
| 4 | speed_le_65 | implied speed <= 65 mph | distance/duration above highway speed is physically implausible inside NYC; odometer or timestamp errors. | 9,961,866 | 3,549 |
make check-c2advances C2Raw-write guard, transform-change validation, session briefing, row-delta bounds.
Counts surviving each documented filter — the report's data appendix starts here.
The hook must block on failure; a warning is the bug this lesson exists to kill.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Hooks that warn instead of block get ignored — by you and by the agent. A schema-drift warning scrolled past at 2 a.m. is how
airport_feebecomes 100% null and your estimates quietly lose the airport zones. Integrity gates must block: exit 2 is the only verdict that protects the panel. - [both]
〜〜
Validation in the skill is not validation as a hook. C1’s cleaning skill carries the same checks, but a skill runs when invoked and the months you ingest in a hurry are exactly the months nobody invokes it. The skill encodes the procedure; the hook removes the option of skipping it.
- [both]
A full test suite on every edit makes the agent glacial — it will pay your contract suite’s runtime on a README typo. Scope tightly: matchers narrow the tool, and the first line of your hook script should be the cheap path-check that exits 0.
- [CX]
Know which config layer owns hooks. Project and user layers merge per event, so the gate you think you’re testing may be accompanied by a global entry you declared three projects ago — and the merged order is defined, not intuitive. Audit the effective table, not the file you happen to have open.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
Hooks are a genuine parity feature: both tools speak the same event vocabulary (PreToolUse, PostToolUse, SessionStart, SubagentStop, Stop) and the same exit-code verdicts, differing only in configuration surface — settings layers on one side, TOML tables on the other, stable in Codex since v0.124. The real asymmetry is at the top end: Claude Code additionally offers prompt- and agent-based hooks, where the verdict comes from a model judging the action rather than a script matching it. Codex has no equivalent yet.