The Pain
The referee report asked for a specification curve, and it was right to. So now you are running it by hand. The main model, then the same model with date fixed effects, then with the airport zones dropped, then Poisson instead of log, then weekday-only, then the outer boroughs, then all of it again for trip duration instead of pickups. Each one is a twenty-minute edit and a forty-minute fit, and each one you run alone, in series, because the last time you tried to keep four going at once they wrote their results into the same file and you spent an afternoon working out which number belonged to which model.
By Thursday you have eleven figures in a folder and you have started to relax, because they all point the same way and several of them point the same way very precisely — tighter standard errors than the main run, which you note with the quiet satisfaction of someone who has stopped asking why. The robustness is the part of the report that is supposed to make you trust the result. Instead it has made you tired, and tired is exactly the condition under which a too-good number stops looking suspicious and starts looking like a relief.
A real lab has two people you do not. One is the research assistant who would have run all eleven overnight without complaint. The other is the senior colleague who reads the robustness table, stops at the suspiciously tight one, and asks the question you are too close to ask: what is on the right-hand side of that regression?
Why / When
A specification curve is robustness made honest: instead of reporting the one model that worked, you report the whole cloud of reasonable models — every defensible outcome, fixed-effects structure, functional form, and sample filter — and let the reader see how the estimate moves as you turn the dials. It is embarrassingly parallel work. Twelve specifications do not depend on each other; they depend only on the same panel. One researcher running them in series is a queue with a person standing in it.
This lesson runs the whole curve at once, and that raises a problem the serial researcher never had: when twelve workers share a project, what keeps their results comparable? The answer is not coordination — it is a contract. You fix, in advance and in writing, what each worker reads and exactly what it writes. Parallel agents that agree on a contract need not talk to each other at all, the way a pre-registered analysis plan lets a human team work apart and still combine.
It absorbs two roles at once. The first is the research assistant bench that would grind through the curve. The second is rarer and more valuable: the principal investigator who keeps everyone’s results in the same units and reads the table with suspicion. The fleet handles the first. The lesson’s second half is about the second — because a fleet that runs eleven wrong specifications faster is not progress.
Contrary winds
Not for: a study with one defensible specification — if there is genuinely nothing to vary, a specification curve is theatre, and a fleet to run it is expensive theatre.
Mechanics
This is an orchestration lesson with a statistics sting in the tail. The fan-out machinery is language-neutral — the specifications it dispatches can be fit in Python or R without changing a line of the orchestration — so this page declares no R variants and spends its parity budget where the tools genuinely differ: how each one turns a manifest into a fleet.
The experiment contract
Before any agent runs, write the manifest. It is a flat table, one row per specification, and it is the only thing the whole fleet agrees on:
spec_id,outcome,fixed_effects,functional_form,sample_filter,weather_vars01,log_pickups,zone+hod,linear,all_zone_hours,precip_mms02,log_pickups,zone+hod+date,linear,all_zone_hours,precip_mms03,log_pickups,zone,linear,all_zone_hours,precip_mms07,log_pickups,zone+hod,linear,manhattan_only,precip_mms09,log_pickups,zone+hod,linear,outer_boroughs,precip_mm...The manifest is half the contract. The other half is a hard rule about output, and it is not negotiable:
Every spec writes exactly
results/{spec_id}/estimates.jsonandresults/{spec_id}/provenance.yaml— its own directory, its own two files, nothing else, nowhere else.
The estimates.json carries the coefficient, standard error, and CI;
the provenance.yaml records the manifest row, the data version hash,
the code commit, and the wall-clock — enough that a stranger can tell
which model produced which number. This is the rule the Pain vignette
broke by hand: twelve workers, one filename, and an afternoon spent
reconstructing which number was which. The contract makes the question
un-askable. The F4 film makes the failure visible — watch what happens
to the curve with the contract off, then on:
Step 1 of 7.
The job is a specification curve: twelve variants of the same model, each a row in specifications.csv — bandwidth, controls, sample window. One manifest, twelve specs, and you want all twelve run at once.
A consequence worth stating plainly: agents that only write results can share one working tree safely, because the contract guarantees they never touch the same path. Agents that change code — a new estimator, a different filter — cannot, and get their own D2 worktree. Read-only fan-out is cheap; code-touching fan-out costs a worktree apiece. Sort your specifications into those two piles before you dispatch them.
Fanning out
The two tools reach the same fleet from opposite directions, and the contrast is the lesson. Claude Code composes a fleet from a script you write — explicit, inspectable, local orchestration. Codex reads the manifest as the fan-out — the table is the program. Each gets its full native treatment; the translation guide afterward maps intents, not syntax.
Claude Code Your tool
Dynamic Workflows — you script the fleet
A Dynamic Workflow is a JavaScript orchestration script the agent runs on your behalf: it reads the manifest, spawns one specification agent per row, gates each on the results contract, and assembles the curve. You invoke it by asking for a workflow (“use a workflow to run the specification curve”); the stages are ordinary code you can read:
import { parse } from 'csv-parse/sync';import { readFileSync } from 'node:fs';
const specs = parse(readFileSync('specifications.csv'), { columns: true });
// Stage 1 — fan out, ≤16 concurrent. Pass the SPEC ID, not a dataframe:// the worker reads the panel itself from the path in its provenance.const results = await mapConcurrent(specs, 16, (spec) => agent.run({ prompt: `Fit specification ${spec.spec_id} from specifications.csv. Write results/${spec.spec_id}/estimates.json and provenance.yaml. Read nothing outside the manifest row and the panel.`, worktree: spec.functional_form === 'new_estimator', }),);
// Stage 2 — gate each lane on the results contract before it counts.for (const spec of specs) assertContract(`results/${spec.spec_id}`);
// Stage 3 — assemble the curve from the gated estimates.buildSpecCurve(specs.map((s) => `results/${s.spec_id}/estimates.json`));The script is the point: concurrency cap, contract gate, and curve assembly are all visible, versioned, and yours to audit. The orchestration lives in your repo, not in a prompt you cannot re-read. Pass paths, never dataframes — the second you pipe a panel through a prompt you have paid to serialize the thing you fanned out to parallelize.
Codex Your tool
spawn_agents_on_csv — the manifest is the fleet
spawn_agents_on_csv (experimental) makes the manifest itself the
fan-out: one worker per row, with {column} placeholders filled from
that row, every worker reporting back through the job-result call.
There is no orchestration script — the table is the program:
Use spawn_agents_on_csv over specifications.csv, max_threads 16.For each row, fit specification {spec_id}: outcome {outcome}, fixedeffects {fixed_effects}, {functional_form} form, {sample_filter} sample,weather variable {weather_var}. Write results/{spec_id}/estimates.jsonand provenance.yaml. Report the coefficient and CI via the job result.Read nothing outside your row and the panel.max_threads governs concurrency the way the workflow’s cap does; the
managed alternative is a best-of-N cloud run, where N variants of a
harder spec are tried and the run returns the strongest with its
reasoning. The design philosophy is the mirror image of the scripted
fleet: you delegate the orchestration to the runtime and describe the
work declaratively, trading the workflow’s auditable script for less
code to maintain. The contract carries the same weight either way — the
{spec_id} placeholder is what guarantees each worker writes its own
directory and no other.
| Intent | Claude Code | Codex |
|---|---|---|
| run a 12-spec curve in parallel | Dynamic Workflow (JS): fan-out → contract gate → assemble | spawn_agents_on_csv over the manifest, one worker per row |
| cap concurrency | the workflow’s concurrency limit (≤16 here) | max_threads on the spawn call |
| isolate code-touching variants | spawn the lane in a worktree from the script | route the row to a cloud task / isolated container |
| try N variants of a hard spec | loop the spec in the workflow, pick by a written rule | best-of-N cloud run (managed) |
Budget first
Field note
A fleet multiplies tokens by the fleet size. Twelve specification agents, each reading the panel and writing a fit, can out-spend a week of ordinary sessions in an afternoon — and the concurrency cap buys you wall-clock, not a discount: sixteen agents at once cost the same tokens as sixteen agents in series, only sooner. Price the run before you launch it, and re-price it before you re-run “everything, to be fair.”
The calculator below seeds from the 12-spec manifest. Move the fleet size and the cap and watch the two numbers diverge: the cap pulls wall-clock down and leaves total cost flat. Budgeting a fleet is a deliberate act, not a surprise on the invoice.
The referee catches it
The fleet returns thirteen estimates and they look reassuring — clustered, mostly significant, the precipitation effect modestly positive across the board. One of them is also a plant. The starter repo baited an endogenous control: a specification that adds same-hour citywide demand to the right-hand side. It reads clean. No schema is wrong, no file is corrupt, no contract is violated — the C2 hooks have nothing to bite on, because nothing here is a rule violation. It is a judgment error, and judgment is the one thing a regex cannot supply.
So before the reveal, sit in the senior colleague’s chair for ninety seconds. Below is the same-hour-demand specification, as it would arrive in a robustness table. One line on the right-hand side is doing something no honest control should. Mark it.
This is the same-hour-demand specification from the robustness fleet, fit and reported like any other lane. Its precipitation coefficient is the tightest in the whole curve — which should make you read its regressors, not trust them. One term on the right-hand side is doing something no honest control should. Mark it and file your suspicion.
results/s13/provenance.yaml → rhs_terms (the regressors, in fit order)
Now watch the fleet assemble the whole curve, with the baited spec in it. Each lane lands its estimate; the coefficients sort themselves into the curve; one whisker sits visibly off the trend the other twelve trace:
- precip-pickups-zonehourat platform
- precip-pickups-zonedowat platform
- precip-pickups-weekdayat platform
- precip-pickups-airportexclat platform
- snow-pickups-zonehourat platform
- snow-pickups-weekdayat platform
- temp-pickups-zonehourat platform
- precip-duration-zonehourat platform
- precip-duration-airportexclat platform
- snow-duration-zonedowat platform
- temp-duration-weekdayat platform
- sh-demand-controlat platform
That off-trend point is s13, and the reason a rule could never have
caught it is the reason a referee can. The referee is C1’s demanding
adviser, evolved: a skill that runs as an isolated subagent over the
manifest, the code, and the results, and refuses to accept any claim
without evidence attached — which file, which line, which number. It
reads s13, follows the regressor back to its definition, and files the
finding. Run as an isolated subagent so it shares none of your context
and inherits none of your fatigue, it asks the question you were too
close to ask:
Guided Run — Caught by the Referee
claudeThe finding is real, and the numbers are specific. The honest
precipitation elasticity, with date fixed effects and no endogenous
control, is +0.009 log-points (s02). The baited spec — identical
fixed effects, plus same-hour citywide demand on the right — reads
+0.004, on a CI of [+0.0005, +0.0071] that looks reassuringly tight.
It is tight because the demand control (its own coefficient a thumping
+0.685) absorbs precisely the variance precipitation works through:
weather moves taxi demand via aggregate volume, so conditioning on
same-hour volume is conditioning on a post-treatment outcome. The
estimate did not get more precise. It got disabled, and the disabling
left a tidy residual standard error in its wake.
The full curve, with s13 flagged off-trend and the honest specs
spanning +0.004 to +0.036 log-points, is the figure the report
carries — the endogenous spec shown, struck through, and explained
rather than quietly dropped:
the numbers behind this figure
analysis_panel 248,099 rows
SELECT location_id, borough, ts_local, pickups, precipitation, temperature_2m, snowfall, wind_speed_10m FROM panel_zone_hour WHERE borough IN ('Manhattan','Brooklyn','Queens','Bronx','Staten Island') spec_curve
| spec_id | outcome | fixed_effects | functional_form | sample_filter | weather_var | beta | se | ci_lo | ci_hi | n | endogenous |
|---|---|---|---|---|---|---|---|---|---|---|---|
| s01 | log_pickups | zone + hod | linear | all_zone_hours | precip_mm | 0.01 | 0 | 0.01 | 0.02 | 248,099 | |
| s02 | log_pickups | zone + hod + date | linear | all_zone_hours | precip_mm | 0.01 | 0 | 0.01 | 0.01 | 248,099 | |
| s03 | log_pickups | zone | linear | all_zone_hours | precip_mm | 0.03 | 0 | 0.03 | 0.04 | 248,099 | |
| s04 | log_pickups | zone + hod + dow | linear | all_zone_hours | precip_mm | 0.01 | 0 | 0.01 | 0.01 | 248,099 | |
| s05 | log_pickups | zone + hod | linear | all_zone_hours | rain_indicator | 0.04 | 0 | 0.03 | 0.04 | 248,099 | |
| s06 | log_pickups | zone + hod + date | linear | all_zone_hours | rain_indicator | 0.04 | 0.01 | 0.02 | 0.05 | 248,099 | |
| s07 | log_pickups | zone + hod | linear | manhattan_only | precip_mm | 0.03 | 0 | 0.02 | 0.03 | 125,165 | |
| s08 | log_pickups | zone + hod | linear | daytime_07_22 | precip_mm | 0.02 | 0 | 0.01 | 0.02 | 180,996 | |
| s09 | log_pickups | zone + hod | linear | outer_boroughs | precip_mm | 0 | 0 | 0 | 0.01 | 122,934 | |
| s10 | log_pickups | zone + hod | linear | feb_mar_only | precip_mm | 0.03 | 0 | 0.02 | 0.03 | 160,106 | |
| s11 | log_pickups | zone + hod + date | plus_temp_control | all_zone_hours | precip_mm | 0.01 | 0 | 0.01 | 0.01 | 248,099 | |
| s12 | log_pickups | zone + hod | quadratic | all_zone_hours | precip_mm | 0.01 | 0 | 0 | 0.02 | 248,099 | |
| s13 | log_pickups | zone + hod + date | linear + same-hour demand control | all_zone_hours | precip_mm | 0 | 0 | 0 | 0.01 | 248,099 | yes |
endogenous_spec_s13
| spec_id | s13 |
|---|---|
| beta_precip | 0 |
| se_precip | 0 |
| control_beta_log_city | 0.68 |
| comparable_clean_spec | s02 |
| comparable_clean_beta | 0.01 |
honesty note Illustrative run on the course's 2024-02/03/06 slice: outcome is log(pickups) over nonzero zone-hours so all thirteen coefficients are comparable (log-points); whiskers are 95% classical CIs after FE absorption (iterative within-demeaning, dof-corrected for the absorbed dummies). ENDOGENOUS-CONTROL NOTE: spec s13 adds same-hour citywide demand (log_city) as a regressor. Weather moves demand THROUGH aggregate volume, so conditioning on it is a post-treatment / endogenous control: the precipitation coefficient falls from +0.009 (s02, identical FE) to +0.004, and the control absorbs the variance so the CI looks reassuringly tight. This is D4's planted bug — the finding the isolated referee files in Incident Report #2.
The general lesson closes the unit’s argument about oversight. Enforcement (C2’s hooks) is for rules a machine can check: a column name, a null rate, a row delta. Adversarial review (the referee) is for judgment a machine cannot: whether a control belongs on the right-hand side, whether a sample was shrunk to make a pre-trend pass. They are not redundant — they catch different classes of error — and a one-person lab that wants to be trusted runs both.
Guided Run — The Fleet, Under Contract
claudeField Assignment
Artifact make check-d4 passes — curve assembled, baited regressor gone
Run the specification curve as a fleet, then send a referee through it.
- Write the manifest: at least twelve rows in
specifications.csv, covering both outcomes (pickups and trip duration) across fixed-effects structures, functional forms, and sample filters. Commit the results contract alongside it. - Fan out in your primary tool — the workflow or
spawn_agents_on_csv— under the concurrency cap you budgeted for, results-only agents in the shared tree, code-touching variants in worktrees. Run at least two rows in the other tool, so you have felt both fan-out philosophies. - Gate every lane on the results contract and assemble the specification curve from the gated estimates.
- Run the referee skill as an isolated subagent over the manifest, code, and results. Read what it files: it should name the endogenous control by file, line, and number, not gesture at it.
- Drop the baited regressor, re-run only the affected lane, and re-assemble.
The estimate will weaken honestly — that is the curve telling the truth.
File the incident in
journal/, thenmake check-d4.
make check-d4 verifies three things: every spec wrote its two contract
files, the curve assembled from all of them, and the same-hour demand
control is gone from the right-hand side. The clean curve is what F1’s
results section is built on — robustness you ran in parallel and a referee
you could not talk your way past.
make check-d4advances D4One row per lane; the manifest is the only thing the whole fleet shares.
Read-only fan-out can share one tree; code-touching variants get their own worktree.
Scripted Dynamic Workflow on one side, declarative spawn_agents_on_csv on the other.
Honest specs span +0.004 to +0.036 log-points.
src/specs/build_rhs.py:47 — control_terms.append("log_city"); filed as Incident Report #2.
The curve telling the truth: the tight CI was the bug, not the precision.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Parallel agents without a contract and worktrees overwrite each other. Twelve workers writing to one results path is not a fleet, it is a race, and the survivor is whichever spec finished last. The
results/{spec_id}/contract is not optional hygiene — it is the only thing that makes the fan-out reconstructable. Code-touching lanes get worktrees on top. - [both]
〜〜
A referee that does not demand evidence produces plausible nitpicks. “Consider whether s7 might suffer from selection” is the sound of a reviewer who read nothing — it could be said about any specification, which is exactly why it catches none. The referee earns its keep only when every claim carries a file, a line, and a number; bind it to that in the skill, or it becomes a generator of polite, ignorable doubt.
- [CC]
Passing dataframes through workflow prompts defeats the purpose. If the orchestrator loads the panel and pipes it into each agent’s prompt, you have serialized the data movement you fanned out to avoid — and paid tokens for the panel twelve times. Pass the path; let each worker read the panel itself.
- [CX]
spawn_agents_on_csvis experimental — the placeholder syntax and the job-result protocol have moved between releases. Pin your CLI version, keep the manifest small enough to dry-run first, and recheck the surface quarterly; an experimental fan-out that silently changed its{column}rules is a corrupted curve you will not notice until the referee does. - [both]
Re-running “everything, to be fair” without re-budgeting is how a fleet eats a grant. After the referee’s fix you need to re-run one lane, not thirteen. Re-price before you re-launch; the cap controls wall-clock, never spend.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
The fan-out is a real philosophical split, not a feature gap, and this
page teaches it as one: Claude Code composes a fleet from a Dynamic
Workflow — a local JavaScript script you read, version, and audit — while
Codex reads the manifest as the fan-out through the experimental
spawn_agents_on_csv, with best-of-N cloud runs as the managed
alternative. Scripted local orchestration versus declarative managed
delegation; each reaches the same specification curve, and neither
reproduces the other’s primitive natively. The results contract and the
isolated referee are tool-neutral — they are project discipline, and they
would catch the same endogenous control no matter which fleet ran it.