Cheat sheet

C2 intermediate ~60 min

Rules That Enforce Themselves

Absorbs: the data manager / QA who never sleeps

Advances C2

The Pain

It’s week six. The February file arrived on schedule, the download script verified its checksum, and the ingest ran green: 2.9 million rows, exit code zero. Somewhere inside that file, a column had changed its name — by a single capital letter. Your pipeline, asked for airport_fee and finding nothing under that spelling, did what pipelines do — filled the column with nulls and moved on. Nothing crashed. The dashboard refreshed. The elasticity estimates updated, a little smoother than before, because the panel had quietly lost JFK and LaGuardia — the two places where weather bites hardest.

For eleven days the numbers drifted. You presented on a Tuesday. Your adviser asked, mildly, why the airport zones had gone dark in the heterogeneity table, and you heard yourself say you’d look into it, in the voice of someone whose next three evenings are already spoken for. The post-mortem was short and humiliating: the rename was documented. It sat in a release note the whole time.

A real lab keeps a person for this. The data manager re-checks every load before it counts, trusts no file twice, and reads release notes the morning they appear. The role demands nothing brilliant — only that someone is always looking. Your lab does not have that person. Your lab has you, and that week you were busy being the methodologist, and the week before that, the RA. The vigilance isn’t difficult. It’s permanent. And permanence is exactly what you cannot staff.

Why / When

A hook is a rule that runs automatically at a fixed point in the agent’s working loop — before a tool call, after a write, at the start of a session — and can block what it doesn’t like. This completes the unit’s taxonomy:

MechanismNatureRuns
instruction filealways-on contextevery session, as advice
skillon-demand procedurewhen invoked
hookenforced rulealways, mechanically

The distinction that matters is prompted versus enforced. A prompted guideline — “please validate schemas before writing” — is advice the model weighs against everything else it is juggling, and under pressure advice loses. C1’s cleaning skill carries your validation logic, but a skill runs when someone remembers to invoke it. A hook runs every time, on a trigger you chose, with the authority to reject the result. Advice versus law.

In the research pipeline, hooks guard the ingestion and transformation stages — the places where silent corruption enters. The lab role they absorb is the data manager: permanent vigilance replaced by a few lines of configuration that run on every write, at 2 p.m. or 2 a.m., tired or not.

Contrary winds

Not for: one-off scripts you'll delete tomorrow — a hook outlives the session, so don't legislate throwaway work.

Mechanics

Both tools implement the same hook model with different configuration surfaces. The shared mechanics first; the dialects below.

The hook model

A hook binds three things: an event in the agent’s lifecycle, an optional matcher narrowing which tool calls it applies to, and a command — your script, which receives a JSON description of the tool call on stdin and answers with an exit code. The event vocabulary is the same in both tools:

  • PreToolUse — fires before a tool call runs; can veto it.
  • PostToolUse — fires after; can reject the result and report why.
  • SessionStart — fires when a session opens; its stdout becomes context the agent reads (a briefing, not a gate).
  • SubagentStop / Stop — fire when a subagent or the session winds down; the natural place for end-of-run audits.
System Player film — Hook Lifecycle
The hook lifecycle: the agent's tool call runs and returns a result; the PostToolUse hook reads it and rules — exit 0 passes in silence, exit 2 blocks the work and feeds the hook's stderr back into the agent's context. EXIT 0 EXIT 2 STDERR → CONTEXT THE AGENT TOOL CALL TOOL RESULT POSTTOOLUSE HOOK PASSED EXIT 0 · WORK PROCEEDS BLOCKED EXIT 2 · STDERR RETURNS
step 1/7

Step 1 of 7.

The agent is mid-loop, working on the cleaning pipeline. The rules it works under are not in its head — they are a hooks block committed to .claude/settings.json, so they arrive with the clone.

json
{  "hooks": {    "PostToolUse": [{      "matcher": "Write|Edit",      "hooks": [{ "type": "command",        "command": "python scripts/hooks/on_transform_change.py" }]    }]  }}

Blocking vs advisory

Exit codes are verdicts. Exit 0 is a pass: silence, the work proceeds. Exit 2 is the blocking verdict: before the tool runs, the call is stopped cold; after it runs, the agent is halted on the spot — the write happened, but it cannot be built on until the failure is addressed — and everything your script printed to stderr is fed back into the agent’s context, so the gate and the engineer talk to each other. Any other non-zero exit is advisory: surfaced to you, ignored by the machinery. The C2 rule of thumb: integrity gates block; everything else is a comment.

Where hooks live

In Claude Code, hooks are a hooks block in .claude/settings.json (user, project, and local layers merge). In Codex, they live in hooks.json at the repo root or the [hooks] table of config.toml, stable since v0.124 (as of June 2026). Same events, same verdicts — only the surface differs.

Four research-native recipes

The same four rules, one per event pattern, in either dialect. No linter examples here — these guard data:

  1. Raw is read-only — PreToolUse blocks any write into data/raw/ (B3’s permission rule, doubled at a different layer; belt and suspenders are a research instrument).
  2. Transforms are contracted — PostToolUse runs the validation suite whenever anything under src/transforms/ changes.
  3. Sessions open with a briefing — SessionStart prints the data version hash and per-table row counts, so every session starts knowing what the warehouse holds.
  4. Row deltas are bounded — PostToolUse checks that a warehouse write changed row counts by a plausible amount; a month of yellow cabs is millions of rows, not forty and not forty million.

Claude Code

All four recipes are one hooks block in the project layer of .claude/settings.json — committed, so the rules arrive with the clone:

.claude/settings.json
{
"hooks": {
"PreToolUse": [{
"matcher": "Write|Edit",
"hooks": [{ "type": "command",
"command": "python scripts/hooks/guard_raw.py" }]
}],
"PostToolUse": [{
"matcher": "Write|Edit",
"hooks": [{ "type": "command",
"command": "python scripts/hooks/on_transform_change.py" }]
}, {
"matcher": "Bash",
"hooks": [{ "type": "command",
"command": "python scripts/hooks/bound_row_deltas.py" }]
}],
"SessionStart": [{
"hooks": [{ "type": "command",
"command": "python scripts/hooks/session_brief.py" }]
}]
}
}

The walkthrough, recipe by recipe:

  • guard_raw.py (recipe ①) reads the tool-call JSON from stdin, extracts tool_input.file_path, and exits 2 with a one-line stderr if the path resolves under data/raw/. The matcher "Write|Edit" scopes it to file-editing tools; shell writes are already fenced by B3’s permission rules — two layers, deliberately.
  • on_transform_change.py (recipe ②) checks whether the edited path is under src/transforms/ and, if so, runs scripts/validate_contracts.py (the suite below) against the warehouse. Anything else exits 0 immediately — the scoping lives in the script, so ls never pays for it.
  • session_brief.py (recipe ③) prints data/raw/SHA256SUMS’s digest and SELECT month, count(*) FROM trips_raw GROUP BY 1 to stdout. SessionStart stdout becomes context: the agent starts each session knowing exactly which data version it is standing on.
  • bound_row_deltas.py (recipe ④) compares post-write row counts against a ledger kept in results/row_ledger.json and exits 2 when a delta falls outside per-table bounds.

One realistic extension, and it is Claude Code’s alone: hooks whose verdict comes from a model rather than a script. A prompt-based hook can gate on judgment calls — “does this edit change the sample definition?” — where no regex can. Treat them as a preview of D4’s referee: enforcement for rules, adversarial judgment for everything rules can’t name.

Codex

The same four recipes declare in hooks.json at the repo root, or equivalently in the [hooks] table of config.toml. The TOML form, in the project-layer team config so the rules arrive with the clone:

config.toml
[hooks]
session_start = [
{ command = "python scripts/hooks/session_brief.py" },
]
pre_tool_use = [
{ matcher = "write", command = "python scripts/hooks/guard_raw.py" },
]
post_tool_use = [
{ matcher = "write", command = "python scripts/hooks/on_transform_change.py" },
{ matcher = "shell", command = "python scripts/hooks/bound_row_deltas.py" },
]

The event vocabulary is identical to Claude Code’s — PreToolUse, PostToolUse, SessionStart, SubagentStop, Stop — written in TOML’s snake_case. The verdict protocol is identical too: the hook command reads the tool-call JSON on stdin, exit 0 passes, exit 2 blocks and returns stderr to the agent. The four scripts are the same files; the dialect difference is confined entirely to this table.

  • guard_raw.py (recipe ①) vetoes file writes under data/raw/; the matcher = "write" entry scopes it to file-editing tools, while shell stays fenced by the B3 sandbox profile.
  • on_transform_change.py (recipe ②) self-scopes to src/transforms/ paths and runs the contract suite on a match.
  • session_brief.py (recipe ③) prints the data-version digest and row counts; session_start stdout is read into the agent’s context.
  • bound_row_deltas.py (recipe ④) bounds warehouse row deltas after shell writes.

The extension that matters here is layering: know which config layer owns hooks. Project-layer declarations and your personal ~/.codex/config.toml merge per event — entries from every layer run. A repo-shipped contract gate therefore cannot be silently disabled by a user-level table that never mentions post_tool_use; but the reverse also holds, and the session_start briefing you forgot you declared globally will follow you into every project until you go looking for it.

Watch this space

Codex hooks stabilized at v0.124; earlier releases shipped them behind a flag with a narrower event list. Pin your CLI version and recheck quarterly.

The validation suite is yours

The hook is nine lines of configuration; the judgment lives in your own validation suite, versioned with the project like any other methodology. The contract checks — expected schema, null-rate ceilings, row-delta bounds — are statistics, so they come in both of the lab’s languages:

Python

scripts/validate_contracts.py
import sys
import duckdb
EXPECTED = {"airport_fee": "DOUBLE", "tpep_pickup_datetime": "TIMESTAMP"}
NULL_CEILING = {"airport_fee": 0.05} # the JFK signal must stay visible
ROW_BOUNDS = (2_000_000, 4_800_000) # a plausible yellow-cab month
con = duckdb.connect("warehouse.duckdb", read_only=True)
cols = {name: dtype for name, dtype, *_ in con.sql("DESCRIBE trips_raw").fetchall()}
failures = [f"column {c}: expected {t}, found {cols.get(c)}"
for c, t in EXPECTED.items() if cols.get(c) != t]
LATEST = "month = (SELECT max(month) FROM trips_raw)"
for col, ceiling in NULL_CEILING.items():
if col in cols:
rate = con.sql(
f"SELECT avg(({col} IS NULL)::INT) FROM trips_raw WHERE {LATEST}"
).fetchone()[0]
if rate > ceiling:
failures.append(f"{col}: {rate:.1%} null in latest month "
f"(ceiling {ceiling:.0%})")
n = con.sql(f"SELECT count(*) FROM trips_raw WHERE {LATEST}").fetchone()[0]
if not ROW_BOUNDS[0] <= n <= ROW_BOUNDS[1]:
failures.append(f"latest month has {n:,} rows — outside {ROW_BOUNDS}")
if failures:
print("FAILED trips_raw:\n " + "\n ".join(failures), file=sys.stderr)
sys.exit(2) # the blocking verdict: reject the write, report to the agent

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

scripts/validate_contracts.R
library(duckdb)
expected <- c(airport_fee = "DOUBLE", tpep_pickup_datetime = "TIMESTAMP")
null_ceiling <- c(airport_fee = 0.05) # the JFK signal must stay visible
row_bounds <- c(2e6, 4.8e6) # a plausible yellow-cab month
con <- dbConnect(duckdb(), "warehouse.duckdb", read_only = TRUE)
cols <- dbGetQuery(con, "DESCRIBE trips_raw")
types <- setNames(cols$column_type, cols$column_name)
failures <- character(0)
got <- types[names(expected)]
bad <- names(expected)[is.na(got) | got != expected]
if (length(bad) > 0)
failures <- c(failures, sprintf("column %s: expected %s, found %s",
bad, expected[bad], got[bad]))
latest <- "month = (SELECT max(month) FROM trips_raw)"
for (col in names(null_ceiling)) {
if (col %in% names(types)) {
rate <- dbGetQuery(con, sprintf(
"SELECT avg((%s IS NULL)::INT) FROM trips_raw WHERE %s", col, latest))[[1]]
if (rate > null_ceiling[[col]])
failures <- c(failures,
sprintf("%s: %.1f%% null in latest month", col, 100 * rate))
}
}
n <- dbGetQuery(con, sprintf(
"SELECT count(*) FROM trips_raw WHERE %s", latest))[[1]]
if (n < row_bounds[1] || n > row_bounds[2])
failures <- c(failures, sprintf("latest month has %s rows — out of bounds",
format(n, big.mark = ",")))
if (length(failures) > 0) {
message("FAILED trips_raw:\n ", paste(failures, collapse = "\n "))
quit(status = 2) # the blocking verdict: reject the write, report to the agent
}

Note what the ceilings encode: not “no nulls” (real taxi data has nulls) but “the airport fee cannot vanish.” Contracts are research judgments with teeth — the rationale lines from C1’s skills, promoted to law.

The review bench

Before the machinery does it for you, sit at the data manager’s desk once. Below are the two parquet footers from the Pain vignette — the January file that ran green and the February file that just arrived. Somewhere in that listing is the drift. You get three filings; the bench keeps score honestly, and missing is instructive — you will know exactly what you are asking a hook to never miss.

Review benchThe February footers, side by side

Two parquet footers: January 2023 (loaded last month, ran green) and the February file that just arrived. One of these lines is eleven days of silent drift. Mark the suspect line and file your suspicion.

$ python scripts/probe_schemas.py --months 2023-01 2023-02 (footer range reads, ~64 KB each — no downloads)

Now watch the law fire. The guided run below replays the February incident with the gate installed: the ingest reports success, the hook reads the warehouse, and the rename that cost you eleven days in the Pain vignette is caught in roughly twenty seconds.

Schema drift, caught in the wild
TLC renamed airport_fee → Airport_fee between the 2023-01 and 2023-02 yellow files (dtype unchanged: double). A naïve name-based merge silently NULLs the airport signal for every later row.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · c2-drift

remote_footer_probe 26 months probed · → out/schema_probe.json

(pyarrow ParquetFile over HTTP range reads — src/probe_schemas.py; ~64 KB/month, no full downloads)

honesty note REAL drift: the flip is 2023-01 → 2023-02, not the course's 2024-03 placeholder — our local 2024-02 and 2024-03 files both already carry Airport_fee. Null rates for 2022-12 (3.7%) and 2023-01 (2.3%) come from parquet footer statistics; the 100% bars for 2023-02/03 are definitional (the lowercase column is absent from those files, so a name-based merge yields NULL for all 2,913,955 / 3,403,766 rows). Nothing reconstructed.

Guided Run — Saved by the Hook

Field Terminal — session: c2-hook-drift Claude Code
claude

Field Assignment

Artifact make check-c2 passes

Put the cleaning pipeline under contract, then prove the contract holds. You will install the four recipes, run C1’s cleaning procedure across all 24 months, and walk into the planted bug on purpose: the real TLC schema drift in data/raw/yellow_2023-02.parquet — the casing flip, one month into the working window.

Claude Code

  1. Commit the four-recipe hooks block to the project layer of .claude/settings.json, alongside scripts/hooks/ and the validation suite.
  2. Open a fresh session and confirm recipe ③ fires: the data-version digest and row counts appear before you type anything.
  3. Run /clean-trips across all 24 months and produce the filter cascade table — row counts surviving each documented filter, the audit trail that becomes the report’s data appendix.
  4. When the February file hits the contract gate, read the hook’s stderr in the transcript, approve the agent’s rename-map fix in src/transforms/standardize.py, and re-run to green.
  5. File the incident in journal/2023-02-drift.md: cause, what caught it, time-to-detection. Then make check-c2.

Codex

  1. Commit the four-recipe [hooks] table (or hooks.json) to the project layer, alongside scripts/hooks/ and the validation suite.
  2. Open a fresh session and confirm recipe ③ fires: the data-version digest and row counts appear before you type anything.
  3. Run $clean-trips across all 24 months and produce the filter cascade table — row counts surviving each documented filter, the audit trail that becomes the report’s data appendix.
  4. When the February file hits the contract gate, read the hook’s stderr in the transcript, approve the agent’s rename-map fix in src/transforms/standardize.py, and re-run to green.
  5. File the incident in journal/2023-02-drift.md: cause, what caught it, time-to-detection. Then make check-c2.

make check-c2 verifies three artifacts: the contracts pass over all 24 months, the cascade table exists with monotonically non-increasing counts, and the incident log records the drift. This is what C3 inherits — the warehouse panel is only worth plumbing because every write that built it passed this gate.

The filter cascade, with receipts
10,129,347 raw rows → 9,961,866 clean across the three months; each documented filter's removal counted and shown in rubric.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · c2-cascade

filter_cascade

SELECT * FROM filter_cascade ORDER BY step
stepidlabelrationalerows_remainingrows_removed
0rawAll rows in the three monthly files10,129,3470
1in_monthPickup inside the file's own calendar monthTLC monthly files contain stray rows dated years away (meter clock faults); they would land in the wrong panel cell or outside the panel.10,129,25889
2fare_nonnegfare_amount >= 0Negative fares are refunds, disputes, or voided meters -- not demand.9,968,476160,782
3duration_posdropoff strictly after pickupZero- and negative-duration trips are meter artifacts; they cannot be completed trips and corrupt every speed or duration measure.9,965,4153,061
4speed_le_65implied speed <= 65 mphdistance/duration above highway speed is physically implausible inside NYC; odometer or timestamp errors.9,961,8663,549
Milestone gate · make check-c2advances C2
  1. Raw-write guard, transform-change validation, session briefing, row-delta bounds.

  2. Counts surviving each documented filter — the report's data appendix starts here.

  3. The hook must block on failure; a warning is the bug this lesson exists to kill.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    Hooks that warn instead of block get ignored — by you and by the agent. A schema-drift warning scrolled past at 2 a.m. is how airport_fee becomes 100% null and your estimates quietly lose the airport zones. Integrity gates must block: exit 2 is the only verdict that protects the panel.

  • [both] 〜〜

    Validation in the skill is not validation as a hook. C1’s cleaning skill carries the same checks, but a skill runs when invoked and the months you ingest in a hurry are exactly the months nobody invokes it. The skill encodes the procedure; the hook removes the option of skipping it.

  • [both]

    A full test suite on every edit makes the agent glacial — it will pay your contract suite’s runtime on a README typo. Scope tightly: matchers narrow the tool, and the first line of your hook script should be the cheap path-check that exits 0.

  • [CX]

    Know which config layer owns hooks. Project and user layers merge per event, so the gate you think you’re testing may be accompanied by a global entry you declared three projects ago — and the merged order is defined, not intuitive. Audit the effective table, not the file you happen to have open.

Check Your Bearings

C2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

Hooks are a genuine parity feature: both tools speak the same event vocabulary (PreToolUse, PostToolUse, SessionStart, SubagentStop, Stop) and the same exit-code verdicts, differing only in configuration surface — settings layers on one side, TOML tables on the other, stable in Codex since v0.124. The real asymmetry is at the top end: Claude Code additionally offers prompt- and agent-based hooks, where the verdict comes from a model judging the action rather than a script matching it. Codex has no equivalent yet.

Ledger — C2

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.