D4 advanced ~60 min

Fleets: Orchestrated Robustness

Absorbs: an RA bench and the PI who keeps their results comparable

Advances D4

The Pain

The referee report asked for a specification curve, and it was right to. So now you are running it by hand. The main model, then the same model with date fixed effects, then with the airport zones dropped, then Poisson instead of log, then weekday-only, then the outer boroughs, then all of it again for trip duration instead of pickups. Each one is a twenty-minute edit and a forty-minute fit, and each one you run alone, in series, because the last time you tried to keep four going at once they wrote their results into the same file and you spent an afternoon working out which number belonged to which model.

By Thursday you have eleven figures in a folder and you have started to relax, because they all point the same way and several of them point the same way very precisely — tighter standard errors than the main run, which you note with the quiet satisfaction of someone who has stopped asking why. The robustness is the part of the report that is supposed to make you trust the result. Instead it has made you tired, and tired is exactly the condition under which a too-good number stops looking suspicious and starts looking like a relief.

A real lab has two people you do not. One is the research assistant who would have run all eleven overnight without complaint. The other is the senior colleague who reads the robustness table, stops at the suspiciously tight one, and asks the question you are too close to ask: what is on the right-hand side of that regression?

Why / When

A specification curve is robustness made honest: instead of reporting the one model that worked, you report the whole cloud of reasonable models — every defensible outcome, fixed-effects structure, functional form, and sample filter — and let the reader see how the estimate moves as you turn the dials. It is embarrassingly parallel work. Twelve specifications do not depend on each other; they depend only on the same panel. One researcher running them in series is a queue with a person standing in it.

This lesson runs the whole curve at once, and that raises a problem the serial researcher never had: when twelve workers share a project, what keeps their results comparable? The answer is not coordination — it is a contract. You fix, in advance and in writing, what each worker reads and exactly what it writes. Parallel agents that agree on a contract need not talk to each other at all, the way a pre-registered analysis plan lets a human team work apart and still combine.

It absorbs two roles at once. The first is the research assistant bench that would grind through the curve. The second is rarer and more valuable: the principal investigator who keeps everyone’s results in the same units and reads the table with suspicion. The fleet handles the first. The lesson’s second half is about the second — because a fleet that runs eleven wrong specifications faster is not progress.

Contrary winds

Not for: a study with one defensible specification — if there is genuinely nothing to vary, a specification curve is theatre, and a fleet to run it is expensive theatre.

Mechanics

This is an orchestration lesson with a statistics sting in the tail. The fan-out machinery is language-neutral — the specifications it dispatches can be fit in Python or R without changing a line of the orchestration — so this page declares no R variants and spends its parity budget where the tools genuinely differ: how each one turns a manifest into a fleet.

The experiment contract

Before any agent runs, write the manifest. It is a flat table, one row per specification, and it is the only thing the whole fleet agrees on:

spec_id,outcome,fixed_effects,functional_form,sample_filter,weather_var
s01,log_pickups,zone+hod,linear,all_zone_hours,precip_mm
s02,log_pickups,zone+hod+date,linear,all_zone_hours,precip_mm
s03,log_pickups,zone,linear,all_zone_hours,precip_mm
s07,log_pickups,zone+hod,linear,manhattan_only,precip_mm
s09,log_pickups,zone+hod,linear,outer_boroughs,precip_mm
...

The manifest is half the contract. The other half is a hard rule about output, and it is not negotiable:

Every spec writes exactly results/{spec_id}/estimates.json and results/{spec_id}/provenance.yaml — its own directory, its own two files, nothing else, nowhere else.

The estimates.json carries the coefficient, standard error, and CI; the provenance.yaml records the manifest row, the data version hash, the code commit, and the wall-clock — enough that a stranger can tell which model produced which number. This is the rule the Pain vignette broke by hand: twelve workers, one filename, and an afternoon spent reconstructing which number was which. The contract makes the question un-askable. The F4 film makes the failure visible — watch what happens to the curve with the contract off, then on:

System Player film — Fleet Fan-Out, Contracts On/Off

step 1/7

Step 1 of 7.

The job is a specification curve: twelve variants of the same model, each a row in specifications.csv — bandwidth, controls, sample window. One manifest, twelve specs, and you want all twelve run at once.

A consequence worth stating plainly: agents that only write results can share one working tree safely, because the contract guarantees they never touch the same path. Agents that change code — a new estimator, a different filter — cannot, and get their own D2 worktree. Read-only fan-out is cheap; code-touching fan-out costs a worktree apiece. Sort your specifications into those two piles before you dispatch them.

Fanning out

The two tools reach the same fleet from opposite directions, and the contrast is the lesson. Claude Code composes a fleet from a script you write — explicit, inspectable, local orchestration. Codex reads the manifest as the fan-out — the table is the program. Each gets its full native treatment; the translation guide afterward maps intents, not syntax.

Claude Code Your tool

Dynamic Workflows — you script the fleet

A Dynamic Workflow is a JavaScript orchestration script the agent runs on your behalf: it reads the manifest, spawns one specification agent per row, gates each on the results contract, and assembles the curve. You invoke it by asking for a workflow (“use a workflow to run the specification curve”); the stages are ordinary code you can read:

import { parse } from 'csv-parse/sync';
import { readFileSync } from 'node:fs';

const specs = parse(readFileSync('specifications.csv'), { columns: true });

// Stage 1 — fan out, ≤16 concurrent. Pass the SPEC ID, not a dataframe:
// the worker reads the panel itself from the path in its provenance.
const results = await mapConcurrent(specs, 16, (spec) =>
  agent.run({
    prompt: `Fit specification ${spec.spec_id} from specifications.csv.
             Write results/${spec.spec_id}/estimates.json and provenance.yaml.
             Read nothing outside the manifest row and the panel.`,
    worktree: spec.functional_form === 'new_estimator',
  }),
);

// Stage 2 — gate each lane on the results contract before it counts.
for (const spec of specs) assertContract(`results/${spec.spec_id}`);

// Stage 3 — assemble the curve from the gated estimates.
buildSpecCurve(specs.map((s) => `results/${s.spec_id}/estimates.json`));

The script is the point: concurrency cap, contract gate, and curve assembly are all visible, versioned, and yours to audit. The orchestration lives in your repo, not in a prompt you cannot re-read. Pass paths, never dataframes — the second you pipe a panel through a prompt you have paid to serialize the thing you fanned out to parallelize.

Codex Your tool

spawn_agents_on_csv — the manifest is the fleet

spawn_agents_on_csv (experimental) makes the manifest itself the fan-out: one worker per row, with {column} placeholders filled from that row, every worker reporting back through the job-result call. There is no orchestration script — the table is the program:

Use spawn_agents_on_csv over specifications.csv, max_threads 16.
For each row, fit specification {spec_id}: outcome {outcome}, fixed
effects {fixed_effects}, {functional_form} form, {sample_filter} sample,
weather variable {weather_var}. Write results/{spec_id}/estimates.json
and provenance.yaml. Report the coefficient and CI via the job result.
Read nothing outside your row and the panel.

max_threads governs concurrency the way the workflow’s cap does; the managed alternative is a best-of-N cloud run, where N variants of a harder spec are tried and the run returns the strongest with its reasoning. The design philosophy is the mirror image of the scripted fleet: you delegate the orchestration to the runtime and describe the work declaratively, trading the workflow’s auditable script for less code to maintain. The contract carries the same weight either way — the {spec_id} placeholder is what guarantees each worker writes its own directory and no other.

Translation guide
Intent	Claude Code	Codex
run a 12-spec curve in parallel	Dynamic Workflow (JS): fan-out → contract gate → assemble	spawn_agents_on_csv over the manifest, one worker per row
cap concurrency	the workflow’s concurrency limit (≤16 here)	max_threads on the spawn call
isolate code-touching variants	spawn the lane in a worktree from the script	route the row to a cloud task / isolated container
try N variants of a hard spec	loop the spec in the workflow, pick by a written rule	best-of-N cloud run (managed)

Budget first

Field note

A fleet multiplies tokens by the fleet size. Twelve specification agents, each reading the panel and writing a fit, can out-spend a week of ordinary sessions in an afternoon — and the concurrency cap buys you wall-clock, not a discount: sixteen agents at once cost the same tokens as sixteen agents in series, only sooner. Price the run before you launch it, and re-price it before you re-run “everything, to be fair.”

The calculator below seeds from the 12-spec manifest. Move the fleet size and the cap and watch the two numbers diverge: the cap pulls wall-clock down and leaves total cost flat. Budgeting a fleet is a deliberate act, not a surprise on the invoice.

Orchestration economicsToken-cost calculator

Total cost$21.60$1.80 / run · 1,440,000 tokens

Serial wall-clock18mone researcher, one run at a time

Concurrent (cap 16)1m 30s1 wave · 12.0× faster

Same dollars either way — fan-out buys wall-clock, not a discount. Budget the tokens first; the cap is how the fleet stays affordable and fast at once.

The token-cost calculator needs JavaScript — it recomputes cost and wall-clock as you change fleet size, tier, and the concurrency cap. The D4 budget-first rule is covered in the lesson text: fan-out multiplies tokens; the cap buys wall-clock, not a discount.

The referee catches it

The fleet returns thirteen estimates and they look reassuring — clustered, mostly significant, the precipitation effect modestly positive across the board. One of them is also a plant. The starter repo baited an endogenous control: a specification that adds same-hour citywide demand to the right-hand side. It reads clean. No schema is wrong, no file is corrupt, no contract is violated — the C2 hooks have nothing to bite on, because nothing here is a rule violation. It is a judgment error, and judgment is the one thing a regex cannot supply.

So before the reveal, sit in the senior colleague’s chair for ninety seconds. Below is the same-hour-demand specification, as it would arrive in a robustness table. One line on the right-hand side is doing something no honest control should. Mark it.

Review benchThe right-hand side of spec s13

This is the same-hour-demand specification from the robustness fleet, fit and reported like any other lane. Its precipitation coefficient is the tightest in the whole curve — which should make you read its regressors, not trust them. One term on the right-hand side is doing something no honest control should. Mark it and file your suspicion.

results/s13/provenance.yaml → rhs_terms (the regressors, in fit order)

The review bench needs JavaScript — it withholds an answer until you commit to a guess, which static HTML cannot do. The lesson text covers everything the bench rehearses.

Now watch the fleet assemble the whole curve, with the baited spec in it. Each lane lands its estimate; the coefficients sort themselves into the curve; one whisker sits visibly off the trend the other twelve trace:

Grand Central · fleet dispatchWeather & mobility — the specification curve

precip-pickups-zonehourpickups · precip · log · zone+hourat platform
precip-pickups-zonedowpickups · precip · log · zone×dowat platform
precip-pickups-weekdaypickups · precip · log · zone+hourat platform
precip-pickups-airportexclpickups · precip · poisson · zone+hourat platform
snow-pickups-zonehourpickups · snow · log · zone+hourat platform
snow-pickups-weekdaypickups · snow · poisson · zone×dowat platform
temp-pickups-zonehourpickups · temp · level · zone+hourat platform
precip-duration-zonehourtrip_duration · precip · log · zone+hourat platform
precip-duration-airportexcltrip_duration · precip · level · zone+hourat platform
snow-duration-zonedowtrip_duration · snow · log · zone×dowat platform
temp-duration-weekdaytrip_duration · temp · level · zone+hourat platform
sh-demand-controlpickups · precip · log · zone+hourat platform

No specs returned yet — dispatch the fleet to assemble the curve.

The fleet board needs JavaScript — it dispatches twelve specification lanes and assembles the curve as each returns, which static HTML cannot animate. The manifest (specifications.csv) and the lesson text carry the same twelve specifications and the off-trend finding.

That off-trend point is s13, and the reason a rule could never have caught it is the reason a referee can. The referee is C1’s demanding adviser, evolved: a skill that runs as an isolated subagent over the manifest, the code, and the results, and refuses to accept any claim without evidence attached — which file, which line, which number. It reads s13, follows the regressor back to its definition, and files the finding. Run as an isolated subagent so it shares none of your context and inherits none of your fatigue, it asks the question you were too close to ask:

Guided Run — Caught by the Referee

Field Terminal — session: d4-referee Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

The finding is real, and the numbers are specific. The honest precipitation elasticity, with date fixed effects and no endogenous control, is +0.009 log-points (s02). The baited spec — identical fixed effects, plus same-hour citywide demand on the right — reads +0.004, on a CI of [+0.0005, +0.0071] that looks reassuringly tight. It is tight because the demand control (its own coefficient a thumping +0.685) absorbs precisely the variance precipitation works through: weather moves taxi demand via aggregate volume, so conditioning on same-hour volume is conditioning on a post-treatment outcome. The estimate did not get more precise. It got disabled, and the disabling left a tidy residual standard error in its wake.

Incident report № 2

Caught by the referee

Date: 2024-06
Subject: endogenous control — same-hour citywide demand on the RHS of spec s13
Cause: specifications.csv row s13 adds log_city (same-hour citywide demand), a post-treatment regressor weather acts through
Location: src/specs/build_rhs.py:47 — control_terms.append("log_city")
Effect: precip elasticity collapses +0.009 (s02) → +0.004 on a deceptively tight CI [+0.0005, +0.0071]; control beta +0.685
Detection: referee skill, isolated subagent — flagged, demanded the regressor definition, traced it to the build script
Human action: dropped log_city from s13; re-ran the one affected lane

Filed to journal/spec-curve-referee.md by the run above. No hook could have caught this: nothing was malformed, nothing was renamed, every contract passed. C2’s hook caught a broken rule; the referee caught a bad judgment. A lab needs both, and now both have caught something real.

The full curve, with s13 flagged off-trend and the honest specs spanning +0.004 to +0.036 log-points, is the figure the report carries — the endogenous spec shown, struck through, and explained rather than quietly dropped:

The specification curve, and the spec that lies — Thirteen specifications of the precipitation elasticity (outcome × fixed-effects × functional form × sample filter × weather variable), sorted by coefficient: legitimate specs span β = +0.004 to +0.036 log-points. The flagged spec s13 conditions on same-hour citywide demand — an endogenous, post-treatment control — and collapses the effect to +0.004 on a deceptively tight CI. Illustrative run on the course's data slice.

spec_id	outcome	fixed_effects	functional_form	sample_filter	weather_var	beta	se	ci_lo	ci_hi	n	endogenous
s01	log_pickups	zone + hod	linear	all_zone_hours	precip_mm	0.01	0	0.01	0.02	248,099
s02	log_pickups	zone + hod + date	linear	all_zone_hours	precip_mm	0.01	0	0.01	0.01	248,099
s03	log_pickups	zone	linear	all_zone_hours	precip_mm	0.03	0	0.03	0.04	248,099
s04	log_pickups	zone + hod + dow	linear	all_zone_hours	precip_mm	0.01	0	0.01	0.01	248,099
s05	log_pickups	zone + hod	linear	all_zone_hours	rain_indicator	0.04	0	0.03	0.04	248,099
s06	log_pickups	zone + hod + date	linear	all_zone_hours	rain_indicator	0.04	0.01	0.02	0.05	248,099
s07	log_pickups	zone + hod	linear	manhattan_only	precip_mm	0.03	0	0.02	0.03	125,165
s08	log_pickups	zone + hod	linear	daytime_07_22	precip_mm	0.02	0	0.01	0.02	180,996
s09	log_pickups	zone + hod	linear	outer_boroughs	precip_mm	0	0	0	0.01	122,934
s10	log_pickups	zone + hod	linear	feb_mar_only	precip_mm	0.03	0	0.02	0.03	160,106
s11	log_pickups	zone + hod + date	plus_temp_control	all_zone_hours	precip_mm	0.01	0	0.01	0.01	248,099
s12	log_pickups	zone + hod	quadratic	all_zone_hours	precip_mm	0.01	0	0	0.02	248,099
s13	log_pickups	zone + hod + date	linear + same-hour demand control	all_zone_hours	precip_mm	0	0	0	0.01	248,099	yes

spec_id	s13
beta_precip	0
se_precip	0
control_beta_log_city	0.68
comparable_clean_spec	s02
comparable_clean_beta	0.01

The general lesson closes the unit’s argument about oversight. Enforcement (C2’s hooks) is for rules a machine can check: a column name, a null rate, a row delta. Adversarial review (the referee) is for judgment a machine cannot: whether a control belongs on the right-hand side, whether a sample was shrunk to make a pre-trend pass. They are not redundant — they catch different classes of error — and a one-person lab that wants to be trusted runs both.

Guided Run — The Fleet, Under Contract

Field Terminal — session: d4-fleet Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make check-d4 passes — curve assembled, baited regressor gone

Run the specification curve as a fleet, then send a referee through it.

Write the manifest: at least twelve rows in specifications.csv, covering both outcomes (pickups and trip duration) across fixed-effects structures, functional forms, and sample filters. Commit the results contract alongside it.
Fan out in your primary tool — the workflow or spawn_agents_on_csv — under the concurrency cap you budgeted for, results-only agents in the shared tree, code-touching variants in worktrees. Run at least two rows in the other tool, so you have felt both fan-out philosophies.
Gate every lane on the results contract and assemble the specification curve from the gated estimates.
Run the referee skill as an isolated subagent over the manifest, code, and results. Read what it files: it should name the endogenous control by file, line, and number, not gesture at it.
Drop the baited regressor, re-run only the affected lane, and re-assemble. The estimate will weaken honestly — that is the curve telling the truth. File the incident in journal/, then make check-d4.

make check-d4 verifies three things: every spec wrote its two contract files, the curve assembled from all of them, and the same-hour demand control is gone from the right-hand side. The clean curve is what F1’s results section is built on — robustness you ran in parallel and a referee you could not talk your way past.

Milestone gate · make check-d4advances D4

specifications.csv has at least twelve rows covering both outcomes (pickups and trip duration) across fixed effects, functional forms, and sample filters
One row per lane; the manifest is the only thing the whole fleet shares.
Every spec wrote exactly its two contract files — results/{spec_id}/estimates.json and provenance.yaml, its own directory, nothing else
Read-only fan-out can share one tree; code-touching variants get their own worktree.
At least two rows were run in the other tool, so you have felt both fan-out philosophies
Scripted Dynamic Workflow on one side, declarative spawn_agents_on_csv on the other.
The specification curve assembled from all the gated estimates
Honest specs span +0.004 to +0.036 log-points.
The referee ran as an isolated subagent and named the endogenous control by file, line, and number — not a vague gesture
src/specs/build_rhs.py:47 — control_terms.append("log_city"); filed as Incident Report #2.
The same-hour demand control is gone from the right-hand side, the affected lane re-ran, and the estimate weakened honestly (+0.004 → ~+0.009)
The curve telling the truth: the tight CI was the bug, not the precision.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Parallel agents without a contract and worktrees overwrite each other. Twelve workers writing to one results path is not a fleet, it is a race, and the survivor is whichever spec finished last. The results/{spec_id}/ contract is not optional hygiene — it is the only thing that makes the fan-out reconstructable. Code-touching lanes get worktrees on top.
[both] 〜〜

A referee that does not demand evidence produces plausible nitpicks. “Consider whether s7 might suffer from selection” is the sound of a reviewer who read nothing — it could be said about any specification, which is exactly why it catches none. The referee earns its keep only when every claim carries a file, a line, and a number; bind it to that in the skill, or it becomes a generator of polite, ignorable doubt.
[CC]

Passing dataframes through workflow prompts defeats the purpose. If the orchestrator loads the panel and pipes it into each agent’s prompt, you have serialized the data movement you fanned out to avoid — and paid tokens for the panel twelve times. Pass the path; let each worker read the panel itself.
[CX]

spawn_agents_on_csv is experimental — the placeholder syntax and the job-result protocol have moved between releases. Pin your CLI version, keep the manifest small enough to dry-run first, and recheck the surface quarterly; an experimental fan-out that silently changed its {column} rules is a corrupted curve you will not notice until the referee does.
[both]

Re-running “everything, to be fair” without re-budgeting is how a fleet eats a grant. After the referee’s fix you need to re-run one lane, not thirteen. Re-price before you re-launch; the cap controls wall-clock, never spend.

Check Your Bearings

D4 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Caught by the referee: record the endogenous control — what it was, how it disguised itself as precision, which file and line carried it, and what the honest estimate became once it was gone.

as of June 2026

The fan-out is a real philosophical split, not a feature gap, and this page teaches it as one: Claude Code composes a fleet from a Dynamic Workflow — a local JavaScript script you read, version, and audit — while Codex reads the manifest as the fan-out through the experimental spawn_agents_on_csv, with best-of-N cloud runs as the managed alternative. Scripted local orchestration versus declarative managed delegation; each reaches the same specification curve, and neither reproduces the other’s primitive natively. The results contract and the isolated referee are tool-neutral — they are project discipline, and they would catch the same endogenous control no matter which fleet ran it.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

Fleets: Orchestrated Robustness

The Pain

Why / When

Mechanics

The experiment contract

Fanning out

Dynamic Workflows — you script the fleet

spawn_agents_on_csv — the manifest is the fleet

Budget first

The referee catches it

Guided Run — Caught by the Referee

analysis_panel 248,099 rows

spec_curve

endogenous_spec_s13

Guided Run — The Fleet, Under Contract

Field Assignment

Pitfalls & Gotchas

Check Your Bearings

Ledger — D4

The Lab Roster

Your position

Positions

Running Totals

The Pain

Why / When

Mechanics

The experiment contract

Fanning out

Budget first

The referee catches it

Guided Run — Caught by the Referee

analysis_panel 248,099 rows

spec_curve

endogenous_spec_s13

Guided Run — The Fleet, Under Contract

Pitfalls & Gotchas

Parity note