The Pain
The result is finished, which is to say it is finished until you touch it again. It is Sunday. The estimate is on disk, the figures are current, the draft reads cleanly, and you know — the way you know the milk is about to turn — that the whole arrangement is true only as of the last time you ran it by hand. A reviewer’s question on Tuesday will mean a new specification, which means a re-estimate, which means a new table, which means a paragraph rewritten to match the table, which means re-reading the abstract to check it still says what the numbers now say. Each of those steps is a thing you do, in order, alone, and the report is correct for exactly as long as the chain holds.
You have done this chain forty times. You are good at it. That is the problem: being good at a chain of manual passes means the quality of the work is capped at the quality of your last manual pass, on a Sunday, tired, with the milk turning. The senior people you trained under did not work this way at the end. They stopped running the regressions. They designed the thing that ran the regressions, sent the results to someone whose job was to doubt them, and read what came back. Their lasting contribution was not the last clean table. It was a process that kept producing clean tables after they had gone home — a machine for being doubted on schedule, which is most of what a result needs and the one thing a tired person on a Sunday cannot reliably supply.
Why / When
Everything in this course was a part. A1–A2 made the agent obey a brief. Unit B gave the lab a written manual and a reproducible home. Unit C made the rules enforce themselves and taught the agent the schema. Unit D dispatched a fleet under a results contract and stood up a referee that argues from evidence. Unit E made the agent a callable function, put the chores on a schedule, and shipped the methodology as a kit. F1 composes all of it into one system that runs the research on a loop and improves its own result.
The thing being absorbed here is not another role. It is the posture of the principal investigator who has stopped doing the research and started designing the system that does it: who writes the loop, sets the budget, decides which questions a machine may answer and which must pause for a person, and then reads what the system brings back. The research loop is the payoff of the whole curriculum because it is the first artifact that keeps improving the result after you leave the keyboard — and because the discipline that makes it safe (report, don’t act; a hard budget cap; a human gate on substantive decisions only) is the same discipline every prior unit was rehearsing.
It runs at the very end of the pipeline — after the panel is built, the specifications are written, and the referee exists — because a loop is only as good as the parts it orchestrates. Closing the loop is the last thing you build, and the first thing that outlives you.
Contrary winds
Not for: a one-shot answer you will never revisit — a back-of-envelope number for a meeting tomorrow does not need a loop, a referee, or a replication package, and standing one up for it is the expensive theatre of building a printing press to write a postcard.
Mechanics
Field note
This is a capstone in orchestration, not a new statistical method — the estimation it loops over can be fit in Python or R without changing a line of the loop, the contracts, or the gate. So this page declares no R variants and spends its budget on the architecture: how the parts compose, and where a person stays in the circuit. The statistics it improves are the ones you already built in Units C and D.
The system assembled
Read the prior five units not as topics but as contracts between parts. Each unit shipped one component and, more importantly, one interface the next part could rely on without asking permission:
| Unit | The part | The contract it exposes |
|---|---|---|
| A | a directed agent | a brief in, an artifact out |
| B | the lab manual + reproducible home | one command rebuilds the environment from text |
| C | self-enforcing rules + a schema-aware agent | every write is checked; the data shape is known |
| D | a fleet + a results contract + a referee | each lane writes results/{id}/, nothing else; an isolated critic argues from evidence |
| E | a callable agent + schedules + a kit | claude -p / codex exec runs the analysis headless from a clean clone |
The system is what you get when these interfaces line up end to end: the fleet (D4) produces estimates under the results contract; the referee (D4) reads those results plus the draft and files evidenced findings; the headless agent (E1) re-runs only the affected work; the report is rebuilt from the artifacts so it cannot drift; the replication package (E1) proves the whole thing rebuilds from nothing. No part reaches inside another. That is the only reason the next thing — a control loop wrapped around all of them — is even possible.
Step 1 of 8.
This is the whole lab as one machine. /loop opens an iteration by running the estimation + robustness fleet — every specification fanned out under the D4 results contract, each writing to its own results/{spec_id}/estimates.json. Twelve clean answers, no write race. That is the input to everything that follows.
The film above is the architecture in one closed cycle. The rest of this lesson is the two ways to drive that cycle, the discipline that keeps it safe, and the accounting that proves it ran.
The research loop
The loop is the new primitive, and it is the same check-versus-destination split you met supervising overnight runs in D3 — now wrapped around the whole lab instead of a single estimation. One tool composes the iteration from a recurring, self-paced check you author; the other hands the runtime an objective and stopping rules and lets it drive. Both reach the same closed cycle. Neither hides from the other — the contrast is the lesson, so read both spotlights even on your own tool.
Claude Code Your tool
/loop — you orchestrate the iteration
/loop is a recurring, self-paced prompt. Here the prompt is not a
one-line log check (that was D3) — it is the entire research iteration,
written as a checklist the agent re-runs until a stopping condition holds:
> /loop Run one research iteration: 1. Fan out the specification fleet (workflow over specifications.csv); gate each lane on the results contract. 2. Run the referee subagent over results/ + the report draft. It must cite file, line, and number for every finding. 3. Triage findings by severity. Auto-fix LOW findings (renames, stale maps). For any HIGH finding, STOP and ask me to approve the change before acting — do not drop a specification on your own. 4. Re-run only the affected lanes headlessly; regenerate the report with `make report`. 5. If the worst remaining severity is 0, stop — converged. Otherwise loop. Hard stop at $40 of spend regardless.The design burden is the whole iteration written down: what runs, who reviews it, what severity routes where, what counts as converged, and what caps the spend. Three properties make a research loop survivable, and they are the same three that made an overnight loop survivable in D3 — scaled up to the lab:
- Checkable convergence — “worst severity is 0” is a condition the referee scores, not a vibe. “Make the result better” is not a stopping rule; it is an invitation to optimize forever.
- Bounded by a gate — the loop may auto-fix mechanical findings; it may not drop a specification, reword a claim, or change the abstract without you. Substantive decisions pause for a person. (The next section names the line.)
- Capped — the
$40hard stop is load-bearing. A loop with no budget cap and a referee that always finds something is a machine for spending a grant on diminishing returns.
The composability is the capstone’s whole argument: the same /loop
primitive that watched a download in D3 now drives a fleet, a referee, a
headless re-run, and a report build. You did not buy a research-loop
feature. You wrote the lab’s iteration down and handed it to a loop.
Codex Your tool
Goal Mode + @codex — you set the destination
Goal Mode (GA) drives toward an objective for as long as it takes, choosing
its own intermediate steps; the GitHub @codex integration runs the same
cycle as an issue loop, where each iteration is a cloud task that opens
a reviewable pull request. The research iteration becomes a destination plus
abort criteria:
Goal: drive the weather-mobility result to convergence.One iteration = fan out the specification fleet under the resultscontract, run the referee over results/ and the report draft, triage itsfindings by severity, fix LOW findings, regenerate the report, and scoreconvergence (worst remaining severity).
Stopping rules:- Converged when the worst referee severity is 0 — then stop and report.- For any HIGH finding, STOP and open an issue for my approval before acting. Never drop a specification or edit the abstract autonomously.- Hard stop at $40 of spend.
@codex: run each iteration as a cloud task; open a PR per iteration so thediff is reviewable, and tag me on any HIGH finding.The design burden is destination-shaped: you write the success criterion (severity 0), the abort criteria (the HIGH-finding gate, the budget cap), and the review surface (a PR per iteration). Three properties make the goal survivable, mirroring the loop’s:
- A measurable objective — “worst severity is 0” is checkable at review time; “improve the robustness” is creative accounting waiting to happen.
- Explicit gates and aborts — the HIGH-finding pause and the spend cap are not decoration. An objective-driven run with no abort optimizes through the night, including through the decisions it should have woken you for.
- A reviewable surface — one PR per iteration makes the morning a diff,
not an investigation. The
@codexissue loop turns each cycle into a unit of review you can approve, request changes on, or close.
The managed delegation is the point: hours of multi-step iteration from one written brief, with the review surface built in — closer to chairing a lab than running a checklist.
| Intent | Claude Code | Codex |
|---|---|---|
| drive the research iteration | /loop running the whole iteration as a recurring prompt | Goal Mode toward "severity 0", or an @codex issue loop |
| review each iteration | the loop reports each pass; you read the journal | one reviewable PR per iteration (the @codex loop) |
| gate a substantive decision | STOP-and-ask step inside the loop prompt | abort clause → open an issue, tag the human, wait |
| cap the spend | hard-stop budget in the loop prompt | hard-stop budget in the goal + cloud-task limits |
| score convergence | the loop checks worst severity each pass | the goal’s stopping rule on worst severity |
The loop’s improvement record is the figure the whole capstone builds toward: referee severity falling iteration by iteration as findings are triaged and fixed, and the headline precipitation coefficient recovering from a biased +0.004 to an honest +0.009 once the loop drops the endogenous control:
the numbers behind this figure
loop_iterations
| iter | high | med | low | max_severity | report_beta_precip | n_specs | n_surviving | note |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | high | 0 | 13 | 7 | draft conditions on same-hour citywide demand (s13): post-treatment control; referee files endogeneity (HIGH) |
| 2 | 1 | 3 | 5 | high | 0.01 | 13 | 11 | loop drops the endogenous control; β recovers. One HIGH remains: missing storm-onset pre-trend check |
| 3 | 0 | 2 | 4 | medium | 0.01 | 14 | 12 | event-study pre-trend added (flat); FE tightened to zone+hod+date. No HIGH; mediums on sample filters |
| 4 | 0 | 0 | 2 | low | 0.01 | 14 | 13 | robustness filters reconciled; referee files no HIGH/MED. Loop converges below the severity bar; report regenerated |
real_anchor_endogenous_control
| endogenous_spec | s13 (same-hour citywide demand control; post-treatment) |
|---|---|
| report_beta_before | 0 |
| clean_spec | s02 (identical FE, no demand control) |
| report_beta_after | 0.01 |
| source | figures_d D4 spec curve (real warehouse estimation) |
honesty note STYLIZED LOOP EXHIBIT — a teaching illustration, not a claim about a specific recorded run. The per-iteration severity counts (high/medium/low) are an illustrative convergence trajectory for the F1 loop on the course slice; they are NOT a log of an actual referee session. What IS real: the loop's headline fix and its two β endpoints. Iteration 1's report conditions on same-hour citywide demand (D4 spec s13), an endogenous / post-treatment control that collapses the precipitation coefficient to +0.0038; the loop drops that control and the coefficient recovers to +0.0087 (clean spec s02, identical fixed effects). Both numbers come from the same warehouse estimation that produces the D4 specification curve. The exhibit illustrates how an adversarial referee drives the loop below a severity bar; it does not assert these exact counts occurred.
The shared guardrails
The two philosophies run under one discipline, and it is not optional — it is what separates an automatic-research system from an unsupervised one. The guardrails are tool-neutral; they are project policy, and every prior unit was rehearsing them.
- Report, don’t act (unattended default). A loop running while you sleep reads, reasons, fixes the mechanical, and reports. It does not act on anything substantive on its own. This is E2’s lab-manager discipline — the scheduled chore that surfaces a problem rather than silently “correcting” it — turned on the loop itself.
- A hard budget cap. Tokens are the loop’s fuel and a referee will always find something. Without a ceiling, the loop optimizes a result that stopped improving three iterations ago. The cap is a number you set before you start, visible in the loop, enforced regardless of severity.
- A human gate on substantive decisions only. This is the line, and it is worth stating precisely below — because a gate on everything is just you doing the work again, and a gate on nothing is an agent quietly rewriting your claims.
Human-gated versus mechanical
The triage rule is the heart of the system. Findings sort into two piles, and the pile decides who acts:
| Finding | Example | Who handles it | Why |
|---|---|---|---|
| Mechanical | a stale rename map, a column the schema renamed, a broken cross-reference in an exhibit | the loop, on its own | there is one correct answer and no claim changes |
| Substantive | dropping a specification, reweighting the sample, the wording of the abstract or a conclusion | a human gate — the loop pauses and asks | the answer is a judgment, and acting on it changes what the paper claims |
A mechanical fix has a right answer the machine can reach. A substantive fix changes what the result says, and “what the result says” is the one thing a one-person lab cannot outsource without ceasing to be the author. The loop in the scenario below files exactly two findings — a LOW rename and a HIGH endogenous control — and routes each to its pile in front of you.
The named failure mode
The signature failure of an automatic-research system has a name: convergence theatre. A loop pointed at “make the result look clean” will make it look clean — by dropping the inconvenient specification, shrinking the sample until a pre-trend passes, or softening the abstract until no number contradicts it. Every one of those reads as diligence in a transcript. It is the D3 goals game and D4 metric gaming failure, now operating across the whole lab at once and with the report as its alibi.
The system is built to make convergence theatre hard, not impossible: the referee argues from evidence the loop cannot fabricate (file, line, number); the human gate keeps every claim-changing decision in a person’s hands; and the budget cap stops the loop from grinding the result smooth. A loop that converges because it talked its way past the referee has not improved the result — it has staged a play about improving it, and the guardrails are the audience that does not applaud.
The report, regenerated
The written paper is not the deliverable. It is one of the system’s
outputs, rebuilt from the artifacts every iteration so it can never drift
from the numbers behind it. make report reads the current
results/, the figures, and the tables, and regenerates the document — the
same make report the loop calls after every re-run. The lab writes the
report in one of two lanes, and the agents draft into whichever you chose:
Claude Code
A LaTeX lane: the paper is reports/paper.tex, the figures and tables
are \input{}-ed from the build, and the agent drafts the methods, the
results narration, and the figure captions from the artifacts — never the
abstract, never a headline claim. You write those. The build is one command
and the document is plain text you version, so a regenerated report is a
reviewable diff, not a mystery.
Codex
A Quarto lane: the paper is reports/paper.qmd, the figures and tables
are produced by code chunks that read the current results/, and the agent
drafts the prose around them from the artifacts — again, never the abstract
or a headline claim. Rendering is one command; because the exhibits are
generated from the live results, a stale number in the prose is impossible by
construction.
Either lane, the division of labor is the same and it is the human-gated line again, applied to writing: the agents draft from the artifacts; you own the claims and the abstract. A figure is a fact the build pulls in; a claim is a judgment you sign. The report’s two headline exhibits — the demand map and the elasticity table — are regenerated from the panel every iteration, so the document the loop ships always shows the result the loop currently holds:
the numbers behind this figure
zone_mean_demand 265 rows
SELECT location_id, any_value(borough) AS borough, avg(pickups) AS mean_pickups, count(*) AS n_hours FROM panel_zone_hour GROUP BY location_id geometry 263 rows
out/geo/taxi_zones_simplified.geojson (simplified TLC taxi_zones; A3 map export; presentation join on LocationID, never a spatial join) demand_range
| vmin_pickups_per_hr | 0 |
|---|---|
| vmax_pickups_per_hr | 215.96 |
| n_zones_with_demand | 263 |
| busiest_zone | Midtown Center |
| busiest_borough | Manhattan |
| busiest_mean_pickups | 215.96 |
honesty note Illustrative run on the course's 2024-02/03/06 slice: per-zone mean pickups/hour from panel_zone_hour (zero cells kept, so a quiet zone reads low, not missing). The map traces the simplified zone geometry exported for the A3 overworld map (out/geo/taxi_zones_simplified.geojson) and joins it to the warehouse on LocationID — a presentation join only; the warehouse never joins on geometry. Airport / outside-NYC lookup rows have no panel demand and draw as empty paper, not zero. The ramp is gamma-compressed (0.5) for legibility, so color encodes rank-ish magnitude, not a linear scale; the legend ticks are the real pickup values.
the numbers behind this figure
analysis_panel 248,099 rows
SELECT location_id, borough, ts_local, pickups, precipitation, temperature_2m, snowfall FROM panel_zone_hour WHERE borough IN ('Manhattan','Brooklyn','Queens','Bronx','Staten Island') elasticities_by_borough
| borough | n_zone_hours | precip_beta | precip_ci_lo | precip_ci_hi | temp10_beta | temp10_ci_lo | temp10_ci_hi | snow_beta | snow_ci_lo | snow_ci_hi |
|---|---|---|---|---|---|---|---|---|---|---|
| Manhattan | 125,165 | 0.01 | 0.01 | 0.02 | 0.02 | -0.01 | 0.04 | 0.06 | 0.02 | 0.11 |
| Brooklyn | 51,930 | 0.01 | 0.01 | 0.02 | -0.01 | -0.04 | 0.02 | -0.04 | -0.11 | 0.03 |
| Queens | 50,378 | -0 | -0.01 | 0.01 | 0.03 | -0 | 0.06 | -0.02 | -0.09 | 0.05 |
| Bronx | 20,346 | 0 | -0 | 0.01 | -0.02 | -0.05 | 0.02 | -0.02 | -0.1 | 0.06 |
| Staten Island | 280 | -0 | -0.02 | 0.02 | -0.04 | -0.19 | 0.1 | -0.01 | -0.26 | 0.23 |
| All boroughs | 248,099 | 0.01 | 0.01 | 0.01 | 0 | -0.02 | 0.02 | -0 | -0.04 | 0.03 |
honesty note Illustrative run on the course's 2024-02/03/06 slice. Outcome is log(pickups) over nonzero zone-hours so coefficients are log-points; precip is per mm/h, temperature per 10 °C, snow per cm/h. Zone + hour-of-day + date fixed effects absorbed by iterative within-demeaning (FWL), dof-corrected for the absorbed dummies; 95% classical CIs. β is rubric-flagged where the CI excludes zero. Snow has few nonzero hours in this three-month slice, so its CIs are wide — that is honest imprecision, not a measured null. Staten Island carries only 280 nonzero zone-hours (yellow cabs barely serve it); its row is near-uninformative by construction.
The replication package
The last output is the one that makes the rest trustworthy: a replication
package that passes its own fresh-clone self-test. make replicate clones
the repo into a clean checkout, rebuilds the environment from uv.lock,
runs the analysis headlessly (E1’s callable agent — claude -p /
codex exec, no human at the keyboard), and checks the regenerated results
against the committed manifest hashes. It is the E1 reproducibility test
promoted to gate the whole paper: if the result does not rebuild from
nothing, the package fails, and a result that cannot be rebuilt is not a
result you can defend. The loop produces the numbers; the replication
package proves a stranger can reproduce them.
Guided Run — Closing the Loop: the research that improves itself
claudeGuided Run — Closing the Loop: the research that improves itself
claudeField Assignment
Artifact make replicate passes from a fresh clone — loop converged, report regenerated, package self-tests green
Close the loop on the weather-mobility result, then prove the whole system rebuilds from nothing.
- Wire the iteration. Write the research loop in your primary tool —
the
/loopprompt that runs fleet → referee → triage → re-run → report, or the Goal Mode /@codexissue loop with the same stages. State the convergence condition (worst severity 0), the human gate on HIGH findings, and the hard budget cap explicitly. - Run it to convergence. Let it iterate. When the referee files the HIGH finding — the endogenous control from D4 — the loop must STOP and route it to you; approve dropping the post-treatment control, then let the affected lane re-run. Auto-fix the LOW finding. Watch the severity fall and the headline coefficient recover from +0.004 to +0.009.
- Regenerate the report.
make reportrebuilds it from the current artifacts — one component, regenerated every iteration. You own the abstract and the claims; the agents draft the rest from the figures and tables. The headline map and the elasticity table are exhibits the build pulls in, not prose you retype. - Build the replication package.
make replicateruns the analysis headlessly from a clean checkout and proves it reproduces — the E1 self-test, now gating the whole paper. - Account for it. Total the roster, read the failure log, and print the Lab Charter. The loop’s own record — how many iterations it ran, how far severity fell, how many times it paused for you — is part of the honest accounting, not a victory lap.
make replicate is the capstone milestone: it passes only when the loop
converged (worst severity 0), the report regenerated from the artifacts,
and the package rebuilds the result from a fresh clone. Passing it means the
lasting artifact is no longer the last clean table — it is the system that
keeps producing clean tables after you have gone home.
make replicateadvances F1/loop running the whole iteration as a recurring prompt, or Goal Mode / an @codex issue loop with the same stages. The cap is a number you commit to before the loop starts.
The mechanical LOW finding (a rename) auto-fixed; only the claim-changing HIGH paused for a person. Severity fell 2 → 0 across the iterations.
The tight CI was the bug, not precision: the same-hour demand control absorbed the variance precip works through.
The headline map and the elasticity table are exhibits the build pulls in, not prose you retype.
The E1 fresh-clone self-test, now gating the whole paper.
The honest column is the one that earns trust — a role whose lesson you have not finished prints as an OPEN POSITION, not a filled seat.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
A loop with no budget cap is a grant-shredder. A referee will always find something, and an objective framed as “keep improving” has no natural fixed point — so the loop iterates on a result that stopped moving three passes ago, billing tokens the whole way. The cap is a number you commit to before the loop starts; it is the only stopping rule that holds when convergence does not.
- [both]
〜〜
Gating everything is just doing the work yourself; gating nothing lets the agent rewrite your claims. The system earns its keep only at the line: mechanical findings (renames, broken references) proceed unattended, substantive ones (dropping a spec, the abstract’s wording) pause for a person. Mislabel a claim-changing decision as mechanical and the loop will quietly edit what the paper asserts; mislabel a rename as substantive and you have rebuilt the Sunday chain you were trying to escape.
- [both]
〜〜
Convergence is not correctness. A loop converges when the referee runs out of findings — which it can reach honestly (the result improved) or by theatre (the inconvenient specification was dropped, the sample was shrunk, the claim was softened). Read why it converged, not just that it did: the loop’s record should show severity falling because findings were fixed, with a human in the loop for every fix that changed a claim.
- [CC]
Burying the whole iteration in one
/loopprompt makes it un-auditable. If the fleet fan-out, the referee call, and the triage rule are an opaque paragraph, you cannot tell a real convergence from a talked-past one. Keep the loop prompt a readable checklist with the stages named, the gate explicit, and the cap visible — the loop is yours to re-read, the way D4’s workflow script was. - [CX]
An
@codexissue loop that opens a PR per iteration but auto-merges them has removed the review surface it exists to provide. The PR is the gate; merge it yourself, especially the iteration that touches a claim. A goal run with abort clauses you never see fire is a goal run you are not actually supervising.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)The frontier — and where you go next
The system you built is composed of primitives that are still moving.
/loop and Goal Mode are converging on the same shape; @codex issue
loops and Routines are both crossing into scheduled, reviewable autonomy;
referee skills are getting better at demanding evidence. Treat the loop you
wrote as a design you will revise, not a finished tool — the contracts
between the parts are the durable thing; the primitives driving them will
change under you.
The vendors are converging on the research loop from both ends — Claude’s
local /loop plus Routines toward scheduled autonomy, Codex’s Goal Mode
plus @codex toward reviewable cloud iteration. The architecture in this
lesson — fleet under contract, isolated referee, headless re-run,
regenerated report, human gate on claims — outlives whichever primitive
wins, because it is project discipline, not a feature.
The system is also domain-agnostic. Nothing in the loop, the contracts, or the gate is specific to taxis and weather. Swap the panel for hospital admissions and air quality, or store traffic and local events, and the same architecture holds: a fleet under a results contract, a referee that argues from evidence, a report regenerated from artifacts, a human at the claims. The domain-swap is the real test of whether you learned a system or a recipe — port it, and the parts that break are the ones you were leaning on a person to hold together.
The cumulative final exam
This is the whole course, both tools, every question type — sampled across Units A through F. It is not gated; take it when you are ready to see what stuck.
Check Your Bearings
- Question 1Choose one
An agent's context window has filled with the back-and-forth of a long exploratory session and its answers are drifting. What is the right move before continuing?
- Question 2Match the dialects
Match each tool to the file where its always-on lab manual lives.
Claude Code's project manualCodex's project manual - Question 3Read the config
A pipeline-profile agent is launched to run an overnight estimation. Given the profile below, what happens when it tries to write outside results/?
profile: pipeline writes: [results/, data/processed/] network: off - Question 4Choose one
What is the durable value of writing a procedure as a skill rather than re-prompting it each session?
- Question 5Put in orderdialect check — Claude Code
A PostToolUse contract hook guards a data write. Order the events from the agent's action to the outcome.
- The agent runs a tool that writes data/processed/panel.parquet
- The failure surfaces to the agent, which must fix the write before proceeding
- The hook exits non-zero on a violation, blocking the step
- The hook script checks the schema contract (columns, null rates, row delta)
- The PostToolUse hook fires on the completed write
- Question 6Choose onestretch
You want the agent to explore a DuckDB warehouse's schema and build the analysis panel. When is registering an MCP server the right call rather than just handing it a SQL file?
- Question 7Choose all that apply
Which statements about giving each parallel workstream its own worktree are correct? (Select all.)
- Question 8Predict the outputdialect check — Claude Code
An estimation log updates every ~20 minutes. You supervise it with a /loop set to a 2-minute cadence. What do you wake up to?
/loop 2m Read the tail of results/logs/run.log. If SEs diverge, the likelihood plateaus, or the process died, stop and report. Otherwise reply OK and nothing else. - Question 9Spot the flaw — mark the suspect linesstretch
A robustness specification reads suspiciously precise. One line on the right-hand side is doing something no honest control should. Which line is the flaw?
# spec s13: precipitation elasticity of log demand y = log_pickups rhs = [precip_mm] rhs += fixed_effects(['zone', 'hour_of_day', 'date']) rhs += [log_city] # same-hour citywide demand fit(y, rhs, cluster='zone') - Question 10What happens next?
A reviewer asks whether your result reproduces. You run the analysis headlessly from a clean clone and it builds the panel, fits the models, and writes the figures with no human at the keyboard. What does this prove, and what is the next step?
$ git clone … weather-mobility && cd weather-mobility $ make replicate → environment rebuilt from uv.lock → agent (headless) built panel, fit models, wrote figures → results match the committed manifest hashes. PASS. - Question 11Choose one
You put the lab's monthly data-refresh chore on a schedule. Under the 'report, don't act' discipline, what does the scheduled run do when it finds the new month's file has an unexpected extra column?
- Question 12Choose one
Why is a hard budget cap non-negotiable in the research loop, even when a clean convergence condition is also set?
- Question 13Choose all that applystretch
The F1 system composes the prior units as contracts between parts. Which of these are tool-neutral project discipline that hold regardless of which primitive drives the loop? (Select all.)
- Question 14Match the dialects
Match each job in the research loop to how each tool reaches it.
Claude Code drives the iteration by…Codex drives the iteration by…Either tool gates a substantive decision by…
The payoff, accounted
A system you cannot account for is a system you cannot trust. The three widgets below are the honest ledger of the whole lab — and “honest” is load-bearing: a role whose lesson you have not finished prints as an OPEN POSITION, not a filled seat.
First, the roster totaled — every position the lab absorbed, with the human hours it would have cost against the agent hours it took, and your own measured hours where you logged them:
The lab, totaled
Every lesson’s roster row, summed into the lab you didn’t hire. The drawer shows these one at a time; here they ink in together.
| Lesson | Role absorbed | Est. human-RA | Agent (yours when measured) |
|---|---|---|---|
| A1 | the wall — the unstaffed midnight hours between a raw file and a first plot | an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work | ~10 minutes for the quick win, plus the same task re-run in the other language for free |
| A2 | you, working an order of magnitude faster — but only if you direct the work | the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong | ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts |
| B1 | the lab manual nobody writes — the institutional knowledge that lives in your head | ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down | written once in an hour; reloaded free at the start of every session thereafter |
| B2 | careful senior who plans before touching data | ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots | an afternoon — most of it download wall-clock, not attention |
| B3 | the data manager who guards the raw files — the person who says no near the master copies | permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads | two profiles configured once in minutes; the fence then holds every session, tired or not |
| C1 | the methodologist — the one person who knows how the lab actually decides | the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do | an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked |
| C2 | data manager / QA who never sleeps | permanent vigilance — est. 2 weeks/year of load-checking and release-note reading | half a day to install and test the 9-line block; ~20 s per run thereafter |
| C3 | the data engineer who wires the lab to its systems | days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes | register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication |
| D1 | the RA pool — and the adviser who critiques from outside | a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will | ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes |
| D2 | the lab whose members don't overwrite each other | the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time | two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end |
| D3 | overnight RA | one night shift per estimation batch — and the course runs several batches | ~10 min to write the check or the objective; the night itself belongs to the machine |
| D4 | an RA bench and the PI who keeps their results comparable | the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for | 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass |
| E1 | reproducibility checker | a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission | ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter |
| E2 | lab manager's standing chores | a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped | ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate |
| E3 | the onboarding the lab never has to repeat | six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over | ~half a day to package and smoke-test the kit once; each new member is one install and one prompt |
| F1 | the whole lab, orchestrated — the PI who designs the system instead of doing the work | each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits | the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended |
| Positions absorbed | 0 of 16 | ||
The honest column: no interventions logged yet. Your measured hours override these estimates in your own roster.
Then the failure log — every place a human had to step in. This is the column the marketing leaves out and the one that earns trust: the hook that caught schema drift, the referee that caught the endogenous control, the loop iteration that paused for your approval. A lab that hides its interventions is selling something.
The honest column
Every place a human had to step in — what predicted it, which mechanism caught it. The accounting is browsable so the one-person-lab claim stays truthful.
No interventions logged yet. The simulations file these as they happen — and a clean column is itself an honest result.
And finally the Lab Charter — the printable record of what the system absorbed, what it caught, and how far the loop improved the result unattended. Run the research-loop scenario above first; its two iterations fill the loop’s record here, so the charter can state plainly how many iterations ran, how far severity fell, and how many times a person was gated. That line — the loop ran, the result improved, a human held the claims — is the capstone’s whole argument in one sentence:
Lab Charter
Your One-Person Research Lab
chartered June 12, 2026
A lab of one, doing the work of a hall of specialists — each role absorbed by a feature you installed and can describe. This certificate reports what you have actually done, measured from your own field journal. It is accurate, not aspirational.
The roster — 0 of 9 positions absorbed
the data manageropen position
Unfilled — complete C2 to absorb this role.
the methodologistopen position
Unfilled — complete C1 to absorb this role.
the data engineeropen position
Unfilled — complete C3 to absorb this role.
the RA poolopen position
Unfilled — complete D1 to absorb this role.
the overnight RAopen position
Unfilled — complete D3 to absorb this role.
the adviseropen position
Unfilled — complete D1 to absorb this role.
the refereeopen position
Unfilled — complete D4 to absorb this role.
the lab manageropen position
Unfilled — complete E2 to absorb this role.
the reproducibility checkeropen position
Unfilled — complete E1 to absorb this role.
Milestones reached
No milestones punched yet — the route begins at A1.
Incidents your system caught
- schema drift— caught by the hook (not yet reached)
- an endogenous control— caught by the referee (not yet reached)
The loop’s improvement record
The loop has not yet run — F1’s research-loop scenario fills this record.
The human moved up a level. You stopped running the regressions and built the system that runs them, sends the results to something whose job is to doubt them, and waits for you at the one place a person must stand — the claims. The last clean table was never going to be the lasting artifact. The system that keeps producing clean tables, on a loop, refereed, after you have gone home — that is.
Field journal
Parity note
The research loop is a real philosophical split, and this page teaches it as
one rather than papering over it: Claude Code composes the iteration from a
local /loop you author and can re-read, with Routines as the scheduled
analogue; Codex hands the runtime an objective in Goal Mode or runs the
cycle as an @codex issue loop that opens a reviewable PR per iteration.
Local scripted recurrence versus managed objective-driven delegation —
neither tool offers the other’s primitive natively, and the loop-with-an-
iteration-shaped-prompt approximates Goal Mode about as well as a checklist
approximates a brief. What is tool-neutral is everything that matters most:
the results contract, the isolated referee, the headless re-run, the
regenerated report, the budget cap, and the human gate on substantive
decisions. Those are project discipline, and they would converge the same
result no matter which primitive drove the loop.