Cheat sheet

D4 advanced ~60 min

Fleets: Orchestrated Robustness

Absorbs: an RA bench and the PI who keeps their results comparable

Advances D4

The Pain

The referee report asked for a specification curve, and it was right to. So now you are running it by hand. The main model, then the same model with date fixed effects, then with the airport zones dropped, then Poisson instead of log, then weekday-only, then the outer boroughs, then all of it again for trip duration instead of pickups. Each one is a twenty-minute edit and a forty-minute fit, and each one you run alone, in series, because the last time you tried to keep four going at once they wrote their results into the same file and you spent an afternoon working out which number belonged to which model.

By Thursday you have eleven figures in a folder and you have started to relax, because they all point the same way and several of them point the same way very precisely — tighter standard errors than the main run, which you note with the quiet satisfaction of someone who has stopped asking why. The robustness is the part of the report that is supposed to make you trust the result. Instead it has made you tired, and tired is exactly the condition under which a too-good number stops looking suspicious and starts looking like a relief.

A real lab has two people you do not. One is the research assistant who would have run all eleven overnight without complaint. The other is the senior colleague who reads the robustness table, stops at the suspiciously tight one, and asks the question you are too close to ask: what is on the right-hand side of that regression?

Why / When

A specification curve is robustness made honest: instead of reporting the one model that worked, you report the whole cloud of reasonable models — every defensible outcome, fixed-effects structure, functional form, and sample filter — and let the reader see how the estimate moves as you turn the dials. It is embarrassingly parallel work. Twelve specifications do not depend on each other; they depend only on the same panel. One researcher running them in series is a queue with a person standing in it.

This lesson runs the whole curve at once, and that raises a problem the serial researcher never had: when twelve workers share a project, what keeps their results comparable? The answer is not coordination — it is a contract. You fix, in advance and in writing, what each worker reads and exactly what it writes. Parallel agents that agree on a contract need not talk to each other at all, the way a pre-registered analysis plan lets a human team work apart and still combine.

It absorbs two roles at once. The first is the research assistant bench that would grind through the curve. The second is rarer and more valuable: the principal investigator who keeps everyone’s results in the same units and reads the table with suspicion. The fleet handles the first. The lesson’s second half is about the second — because a fleet that runs eleven wrong specifications faster is not progress.

Contrary winds

Not for: a study with one defensible specification — if there is genuinely nothing to vary, a specification curve is theatre, and a fleet to run it is expensive theatre.

Mechanics

This is an orchestration lesson with a statistics sting in the tail. The fan-out machinery is language-neutral — the specifications it dispatches can be fit in Python or R without changing a line of the orchestration — so this page declares no R variants and spends its parity budget where the tools genuinely differ: how each one turns a manifest into a fleet.

The experiment contract

Before any agent runs, write the manifest. It is a flat table, one row per specification, and it is the only thing the whole fleet agrees on:

specifications.csv
spec_id,outcome,fixed_effects,functional_form,sample_filter,weather_var
s01,log_pickups,zone+hod,linear,all_zone_hours,precip_mm
s02,log_pickups,zone+hod+date,linear,all_zone_hours,precip_mm
s03,log_pickups,zone,linear,all_zone_hours,precip_mm
s07,log_pickups,zone+hod,linear,manhattan_only,precip_mm
s09,log_pickups,zone+hod,linear,outer_boroughs,precip_mm
...

The manifest is half the contract. The other half is a hard rule about output, and it is not negotiable:

Every spec writes exactly results/{spec_id}/estimates.json and results/{spec_id}/provenance.yaml — its own directory, its own two files, nothing else, nowhere else.

The estimates.json carries the coefficient, standard error, and CI; the provenance.yaml records the manifest row, the data version hash, the code commit, and the wall-clock — enough that a stranger can tell which model produced which number. This is the rule the Pain vignette broke by hand: twelve workers, one filename, and an afternoon spent reconstructing which number was which. The contract makes the question un-askable. The F4 film makes the failure visible — watch what happens to the curve with the contract off, then on:

System Player film — Fleet Fan-Out, Contracts On/Off
Fleet fan-out with contracts on and off: twelve specifications fan out from a manifest to a fleet of agents; without a results contract every agent writes the same output file and the assembled spec curve corrupts, with a contract each writes results by spec id and the curve assembles clean. 12 SPECS MANIFEST specifications.csv THE FLEET — ONE AGENT PER SPEC CONTRACT OFF CONTRACT ON all → estimates.json ONE SHARED FILE results/estimates.json 12 writers, last wins CORRUPT CURVE spec_id → own slot ··· results/{spec_id}/estimates.json 12 slots, no overlap CLEAN CURVE
step 1/7

Step 1 of 7.

The job is a specification curve: twelve variants of the same model, each a row in specifications.csv — bandwidth, controls, sample window. One manifest, twelve specs, and you want all twelve run at once.

A consequence worth stating plainly: agents that only write results can share one working tree safely, because the contract guarantees they never touch the same path. Agents that change code — a new estimator, a different filter — cannot, and get their own D2 worktree. Read-only fan-out is cheap; code-touching fan-out costs a worktree apiece. Sort your specifications into those two piles before you dispatch them.

Fanning out

The two tools reach the same fleet from opposite directions, and the contrast is the lesson. Claude Code composes a fleet from a script you write — explicit, inspectable, local orchestration. Codex reads the manifest as the fan-out — the table is the program. Each gets its full native treatment; the translation guide afterward maps intents, not syntax.

Claude Code Your tool

Dynamic Workflows — you script the fleet

A Dynamic Workflow is a JavaScript orchestration script the agent runs on your behalf: it reads the manifest, spawns one specification agent per row, gates each on the results contract, and assembles the curve. You invoke it by asking for a workflow (“use a workflow to run the specification curve”); the stages are ordinary code you can read:

workflows/spec_curve.js
import { parse } from 'csv-parse/sync';
import { readFileSync } from 'node:fs';
const specs = parse(readFileSync('specifications.csv'), { columns: true });
// Stage 1 — fan out, ≤16 concurrent. Pass the SPEC ID, not a dataframe:
// the worker reads the panel itself from the path in its provenance.
const results = await mapConcurrent(specs, 16, (spec) =>
agent.run({
prompt: `Fit specification ${spec.spec_id} from specifications.csv.
Write results/${spec.spec_id}/estimates.json and provenance.yaml.
Read nothing outside the manifest row and the panel.`,
worktree: spec.functional_form === 'new_estimator',
}),
);
// Stage 2 — gate each lane on the results contract before it counts.
for (const spec of specs) assertContract(`results/${spec.spec_id}`);
// Stage 3 — assemble the curve from the gated estimates.
buildSpecCurve(specs.map((s) => `results/${s.spec_id}/estimates.json`));

The script is the point: concurrency cap, contract gate, and curve assembly are all visible, versioned, and yours to audit. The orchestration lives in your repo, not in a prompt you cannot re-read. Pass paths, never dataframes — the second you pipe a panel through a prompt you have paid to serialize the thing you fanned out to parallelize.

Codex Your tool

spawn_agents_on_csv — the manifest is the fleet

spawn_agents_on_csv (experimental) makes the manifest itself the fan-out: one worker per row, with {column} placeholders filled from that row, every worker reporting back through the job-result call. There is no orchestration script — the table is the program:

the fan-out brief
Use spawn_agents_on_csv over specifications.csv, max_threads 16.
For each row, fit specification {spec_id}: outcome {outcome}, fixed
effects {fixed_effects}, {functional_form} form, {sample_filter} sample,
weather variable {weather_var}. Write results/{spec_id}/estimates.json
and provenance.yaml. Report the coefficient and CI via the job result.
Read nothing outside your row and the panel.

max_threads governs concurrency the way the workflow’s cap does; the managed alternative is a best-of-N cloud run, where N variants of a harder spec are tried and the run returns the strongest with its reasoning. The design philosophy is the mirror image of the scripted fleet: you delegate the orchestration to the runtime and describe the work declaratively, trading the workflow’s auditable script for less code to maintain. The contract carries the same weight either way — the {spec_id} placeholder is what guarantees each worker writes its own directory and no other.

Translation guide
Intent Claude Code Codex
run a 12-spec curve in parallel Dynamic Workflow (JS): fan-out → contract gate → assemble spawn_agents_on_csv over the manifest, one worker per row
cap concurrency the workflow’s concurrency limit (≤16 here) max_threads on the spawn call
isolate code-touching variants spawn the lane in a worktree from the script route the row to a cloud task / isolated container
try N variants of a hard spec loop the spec in the workflow, pick by a written rule best-of-N cloud run (managed)

Budget first

Field note

A fleet multiplies tokens by the fleet size. Twelve specification agents, each reading the panel and writing a fit, can out-spend a week of ordinary sessions in an afternoon — and the concurrency cap buys you wall-clock, not a discount: sixteen agents at once cost the same tokens as sixteen agents in series, only sooner. Price the run before you launch it, and re-price it before you re-run “everything, to be fair.”

The calculator below seeds from the 12-spec manifest. Move the fleet size and the cap and watch the two numbers diverge: the cap pulls wall-clock down and leaves total cost flat. Budgeting a fleet is a deliberate act, not a surprise on the invoice.

Orchestration economicsToken-cost calculator

Model tier ($/Mtok — editable)

Total cost$21.60$1.80 / run · 1,440,000 tokens
Serial wall-clock18mone researcher, one run at a time
Concurrent (cap 16)1m 30s1 wave · 12.0× faster

Same dollars either way — fan-out buys wall-clock, not a discount. Budget the tokens first; the cap is how the fleet stays affordable and fast at once.

The referee catches it

The fleet returns thirteen estimates and they look reassuring — clustered, mostly significant, the precipitation effect modestly positive across the board. One of them is also a plant. The starter repo baited an endogenous control: a specification that adds same-hour citywide demand to the right-hand side. It reads clean. No schema is wrong, no file is corrupt, no contract is violated — the C2 hooks have nothing to bite on, because nothing here is a rule violation. It is a judgment error, and judgment is the one thing a regex cannot supply.

So before the reveal, sit in the senior colleague’s chair for ninety seconds. Below is the same-hour-demand specification, as it would arrive in a robustness table. One line on the right-hand side is doing something no honest control should. Mark it.

Review benchThe right-hand side of spec s13

This is the same-hour-demand specification from the robustness fleet, fit and reported like any other lane. Its precipitation coefficient is the tightest in the whole curve — which should make you read its regressors, not trust them. One term on the right-hand side is doing something no honest control should. Mark it and file your suspicion.

results/s13/provenance.yaml → rhs_terms (the regressors, in fit order)

Now watch the fleet assemble the whole curve, with the baited spec in it. Each lane lands its estimate; the coefficients sort themselves into the curve; one whisker sits visibly off the trend the other twelve trace:

Grand Central · fleet dispatchWeather & mobility — the specification curve
  1. precip-pickups-zonehourpickups · precip · log · zone+hourat platform
  2. precip-pickups-zonedowpickups · precip · log · zone×dowat platform
  3. precip-pickups-weekdaypickups · precip · log · zone+hourat platform
  4. precip-pickups-airportexclpickups · precip · poisson · zone+hourat platform
  5. snow-pickups-zonehourpickups · snow · log · zone+hourat platform
  6. snow-pickups-weekdaypickups · snow · poisson · zone×dowat platform
  7. temp-pickups-zonehourpickups · temp · level · zone+hourat platform
  8. precip-duration-zonehourtrip_duration · precip · log · zone+hourat platform
  9. precip-duration-airportexcltrip_duration · precip · level · zone+hourat platform
  10. snow-duration-zonedowtrip_duration · snow · log · zone×dowat platform
  11. temp-duration-weekdaytrip_duration · temp · level · zone+hourat platform
  12. sh-demand-controlpickups · precip · log · zone+hourat platform
coefficient (sorted) →
No specs returned yet — dispatch the fleet to assemble the curve.

That off-trend point is s13, and the reason a rule could never have caught it is the reason a referee can. The referee is C1’s demanding adviser, evolved: a skill that runs as an isolated subagent over the manifest, the code, and the results, and refuses to accept any claim without evidence attached — which file, which line, which number. It reads s13, follows the regressor back to its definition, and files the finding. Run as an isolated subagent so it shares none of your context and inherits none of your fatigue, it asks the question you were too close to ask:

Guided Run — Caught by the Referee

Field Terminal — session: d4-referee Claude Code
claude

The finding is real, and the numbers are specific. The honest precipitation elasticity, with date fixed effects and no endogenous control, is +0.009 log-points (s02). The baited spec — identical fixed effects, plus same-hour citywide demand on the right — reads +0.004, on a CI of [+0.0005, +0.0071] that looks reassuringly tight. It is tight because the demand control (its own coefficient a thumping +0.685) absorbs precisely the variance precipitation works through: weather moves taxi demand via aggregate volume, so conditioning on same-hour volume is conditioning on a post-treatment outcome. The estimate did not get more precise. It got disabled, and the disabling left a tidy residual standard error in its wake.

The full curve, with s13 flagged off-trend and the honest specs spanning +0.004 to +0.036 log-points, is the figure the report carries — the endogenous spec shown, struck through, and explained rather than quietly dropped:

The specification curve, and the spec that lies
Thirteen specifications of the precipitation elasticity (outcome × fixed-effects × functional form × sample filter × weather variable), sorted by coefficient: legitimate specs span β = +0.004 to +0.036 log-points. The flagged spec s13 conditions on same-hour citywide demand — an endogenous, post-treatment control — and collapses the effect to +0.004 on a deceptively tight CI. Illustrative run on the course's data slice.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · d4-spec-curve

analysis_panel 248,099 rows

SELECT location_id, borough, ts_local, pickups, precipitation, temperature_2m, snowfall, wind_speed_10m FROM panel_zone_hour WHERE borough IN ('Manhattan','Brooklyn','Queens','Bronx','Staten Island')

spec_curve

spec_idoutcomefixed_effectsfunctional_formsample_filterweather_varbetaseci_loci_hinendogenous
s01log_pickupszone + hodlinearall_zone_hoursprecip_mm0.0100.010.02248,099
s02log_pickupszone + hod + datelinearall_zone_hoursprecip_mm0.0100.010.01248,099
s03log_pickupszonelinearall_zone_hoursprecip_mm0.0300.030.04248,099
s04log_pickupszone + hod + dowlinearall_zone_hoursprecip_mm0.0100.010.01248,099
s05log_pickupszone + hodlinearall_zone_hoursrain_indicator0.0400.030.04248,099
s06log_pickupszone + hod + datelinearall_zone_hoursrain_indicator0.040.010.020.05248,099
s07log_pickupszone + hodlinearmanhattan_onlyprecip_mm0.0300.020.03125,165
s08log_pickupszone + hodlineardaytime_07_22precip_mm0.0200.010.02180,996
s09log_pickupszone + hodlinearouter_boroughsprecip_mm0000.01122,934
s10log_pickupszone + hodlinearfeb_mar_onlyprecip_mm0.0300.020.03160,106
s11log_pickupszone + hod + dateplus_temp_controlall_zone_hoursprecip_mm0.0100.010.01248,099
s12log_pickupszone + hodquadraticall_zone_hoursprecip_mm0.01000.02248,099
s13log_pickupszone + hod + datelinear + same-hour demand controlall_zone_hoursprecip_mm0000.01248,099yes

endogenous_spec_s13

spec_id s13
beta_precip 0
se_precip 0
control_beta_log_city 0.68
comparable_clean_spec s02
comparable_clean_beta 0.01

honesty note Illustrative run on the course's 2024-02/03/06 slice: outcome is log(pickups) over nonzero zone-hours so all thirteen coefficients are comparable (log-points); whiskers are 95% classical CIs after FE absorption (iterative within-demeaning, dof-corrected for the absorbed dummies). ENDOGENOUS-CONTROL NOTE: spec s13 adds same-hour citywide demand (log_city) as a regressor. Weather moves demand THROUGH aggregate volume, so conditioning on it is a post-treatment / endogenous control: the precipitation coefficient falls from +0.009 (s02, identical FE) to +0.004, and the control absorbs the variance so the CI looks reassuringly tight. This is D4's planted bug — the finding the isolated referee files in Incident Report #2.

The general lesson closes the unit’s argument about oversight. Enforcement (C2’s hooks) is for rules a machine can check: a column name, a null rate, a row delta. Adversarial review (the referee) is for judgment a machine cannot: whether a control belongs on the right-hand side, whether a sample was shrunk to make a pre-trend pass. They are not redundant — they catch different classes of error — and a one-person lab that wants to be trusted runs both.

Guided Run — The Fleet, Under Contract

Field Terminal — session: d4-fleet Claude Code
claude

Field Assignment

Artifact make check-d4 passes — curve assembled, baited regressor gone

Run the specification curve as a fleet, then send a referee through it.

  1. Write the manifest: at least twelve rows in specifications.csv, covering both outcomes (pickups and trip duration) across fixed-effects structures, functional forms, and sample filters. Commit the results contract alongside it.
  2. Fan out in your primary tool — the workflow or spawn_agents_on_csv — under the concurrency cap you budgeted for, results-only agents in the shared tree, code-touching variants in worktrees. Run at least two rows in the other tool, so you have felt both fan-out philosophies.
  3. Gate every lane on the results contract and assemble the specification curve from the gated estimates.
  4. Run the referee skill as an isolated subagent over the manifest, code, and results. Read what it files: it should name the endogenous control by file, line, and number, not gesture at it.
  5. Drop the baited regressor, re-run only the affected lane, and re-assemble. The estimate will weaken honestly — that is the curve telling the truth. File the incident in journal/, then make check-d4.

make check-d4 verifies three things: every spec wrote its two contract files, the curve assembled from all of them, and the same-hour demand control is gone from the right-hand side. The clean curve is what F1’s results section is built on — robustness you ran in parallel and a referee you could not talk your way past.

Milestone gate · make check-d4advances D4
  1. One row per lane; the manifest is the only thing the whole fleet shares.

  2. Read-only fan-out can share one tree; code-touching variants get their own worktree.

  3. Scripted Dynamic Workflow on one side, declarative spawn_agents_on_csv on the other.

  4. Honest specs span +0.004 to +0.036 log-points.

  5. src/specs/build_rhs.py:47 — control_terms.append("log_city"); filed as Incident Report #2.

  6. The curve telling the truth: the tight CI was the bug, not the precision.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    Parallel agents without a contract and worktrees overwrite each other. Twelve workers writing to one results path is not a fleet, it is a race, and the survivor is whichever spec finished last. The results/{spec_id}/ contract is not optional hygiene — it is the only thing that makes the fan-out reconstructable. Code-touching lanes get worktrees on top.

  • [both] 〜〜

    A referee that does not demand evidence produces plausible nitpicks. “Consider whether s7 might suffer from selection” is the sound of a reviewer who read nothing — it could be said about any specification, which is exactly why it catches none. The referee earns its keep only when every claim carries a file, a line, and a number; bind it to that in the skill, or it becomes a generator of polite, ignorable doubt.

  • [CC]

    Passing dataframes through workflow prompts defeats the purpose. If the orchestrator loads the panel and pipes it into each agent’s prompt, you have serialized the data movement you fanned out to avoid — and paid tokens for the panel twelve times. Pass the path; let each worker read the panel itself.

  • [CX]

    spawn_agents_on_csv is experimental — the placeholder syntax and the job-result protocol have moved between releases. Pin your CLI version, keep the manifest small enough to dry-run first, and recheck the surface quarterly; an experimental fan-out that silently changed its {column} rules is a corrupted curve you will not notice until the referee does.

  • [both]

    Re-running “everything, to be fair” without re-budgeting is how a fleet eats a grant. After the referee’s fix you need to re-run one lane, not thirteen. Re-price before you re-launch; the cap controls wall-clock, never spend.

Check Your Bearings

D4 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

The fan-out is a real philosophical split, not a feature gap, and this page teaches it as one: Claude Code composes a fleet from a Dynamic Workflow — a local JavaScript script you read, version, and audit — while Codex reads the manifest as the fan-out through the experimental spawn_agents_on_csv, with best-of-N cloud runs as the managed alternative. Scripted local orchestration versus declarative managed delegation; each reaches the same specification curve, and neither reproduces the other’s primitive natively. The results contract and the isolated referee are tool-neutral — they are project discipline, and they would catch the same endogenous control no matter which fleet ran it.

Ledger — D4

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.