F1 capstone ~150 min

Automatic Research: Closing the Loop

Absorbs: the whole lab, orchestrated — the PI who designs the system instead of doing the work

Advances F1

The Pain

The result is finished, which is to say it is finished until you touch it again. It is Sunday. The estimate is on disk, the figures are current, the draft reads cleanly, and you know — the way you know the milk is about to turn — that the whole arrangement is true only as of the last time you ran it by hand. A reviewer’s question on Tuesday will mean a new specification, which means a re-estimate, which means a new table, which means a paragraph rewritten to match the table, which means re-reading the abstract to check it still says what the numbers now say. Each of those steps is a thing you do, in order, alone, and the report is correct for exactly as long as the chain holds.

You have done this chain forty times. You are good at it. That is the problem: being good at a chain of manual passes means the quality of the work is capped at the quality of your last manual pass, on a Sunday, tired, with the milk turning. The senior people you trained under did not work this way at the end. They stopped running the regressions. They designed the thing that ran the regressions, sent the results to someone whose job was to doubt them, and read what came back. Their lasting contribution was not the last clean table. It was a process that kept producing clean tables after they had gone home — a machine for being doubted on schedule, which is most of what a result needs and the one thing a tired person on a Sunday cannot reliably supply.

Why / When

Everything in this course was a part. A1–A2 made the agent obey a brief. Unit B gave the lab a written manual and a reproducible home. Unit C made the rules enforce themselves and taught the agent the schema. Unit D dispatched a fleet under a results contract and stood up a referee that argues from evidence. Unit E made the agent a callable function, put the chores on a schedule, and shipped the methodology as a kit. F1 composes all of it into one system that runs the research on a loop and improves its own result.

The thing being absorbed here is not another role. It is the posture of the principal investigator who has stopped doing the research and started designing the system that does it: who writes the loop, sets the budget, decides which questions a machine may answer and which must pause for a person, and then reads what the system brings back. The research loop is the payoff of the whole curriculum because it is the first artifact that keeps improving the result after you leave the keyboard — and because the discipline that makes it safe (report, don’t act; a hard budget cap; a human gate on substantive decisions only) is the same discipline every prior unit was rehearsing.

It runs at the very end of the pipeline — after the panel is built, the specifications are written, and the referee exists — because a loop is only as good as the parts it orchestrates. Closing the loop is the last thing you build, and the first thing that outlives you.

Contrary winds

Not for: a one-shot answer you will never revisit — a back-of-envelope number for a meeting tomorrow does not need a loop, a referee, or a replication package, and standing one up for it is the expensive theatre of building a printing press to write a postcard.

Mechanics

Field note

This is a capstone in orchestration, not a new statistical method — the estimation it loops over can be fit in Python or R without changing a line of the loop, the contracts, or the gate. So this page declares no R variants and spends its budget on the architecture: how the parts compose, and where a person stays in the circuit. The statistics it improves are the ones you already built in Units C and D.

The system assembled

Read the prior five units not as topics but as contracts between parts. Each unit shipped one component and, more importantly, one interface the next part could rely on without asking permission:

Unit	The part	The contract it exposes
A	a directed agent	a brief in, an artifact out
B	the lab manual + reproducible home	one command rebuilds the environment from text
C	self-enforcing rules + a schema-aware agent	every write is checked; the data shape is known
D	a fleet + a results contract + a referee	each lane writes `results/{id}/`, nothing else; an isolated critic argues from evidence
E	a callable agent + schedules + a kit	`claude -p` / `codex exec` runs the analysis headless from a clean clone

The system is what you get when these interfaces line up end to end: the fleet (D4) produces estimates under the results contract; the referee (D4) reads those results plus the draft and files evidenced findings; the headless agent (E1) re-runs only the affected work; the report is rebuilt from the artifacts so it cannot drift; the replication package (E1) proves the whole thing rebuilds from nothing. No part reaches inside another. That is the only reason the next thing — a control loop wrapped around all of them — is even possible.

System Player film — The Research Loop

step 1/8

Step 1 of 8.

This is the whole lab as one machine. /loop opens an iteration by running the estimation + robustness fleet — every specification fanned out under the D4 results contract, each writing to its own results/{spec_id}/estimates.json. Twelve clean answers, no write race. That is the input to everything that follows.

The film above is the architecture in one closed cycle. The rest of this lesson is the two ways to drive that cycle, the discipline that keeps it safe, and the accounting that proves it ran.

The research loop

The loop is the new primitive, and it is the same check-versus-destination split you met supervising overnight runs in D3 — now wrapped around the whole lab instead of a single estimation. One tool composes the iteration from a recurring, self-paced check you author; the other hands the runtime an objective and stopping rules and lets it drive. Both reach the same closed cycle. Neither hides from the other — the contrast is the lesson, so read both spotlights even on your own tool.

Claude Code Your tool

/loop — you orchestrate the iteration

/loop is a recurring, self-paced prompt. Here the prompt is not a one-line log check (that was D3) — it is the entire research iteration, written as a checklist the agent re-runs until a stopping condition holds:

> /loop  Run one research iteration:
  1. Fan out the specification fleet (workflow over specifications.csv);
     gate each lane on the results contract.
  2. Run the referee subagent over results/ + the report draft. It must
     cite file, line, and number for every finding.
  3. Triage findings by severity. Auto-fix LOW findings (renames, stale
     maps). For any HIGH finding, STOP and ask me to approve the change
     before acting — do not drop a specification on your own.
  4. Re-run only the affected lanes headlessly; regenerate the report
     with `make report`.
  5. If the worst remaining severity is 0, stop — converged. Otherwise
     loop. Hard stop at $40 of spend regardless.

The design burden is the whole iteration written down: what runs, who reviews it, what severity routes where, what counts as converged, and what caps the spend. Three properties make a research loop survivable, and they are the same three that made an overnight loop survivable in D3 — scaled up to the lab:

Checkable convergence — “worst severity is 0” is a condition the referee scores, not a vibe. “Make the result better” is not a stopping rule; it is an invitation to optimize forever.
Bounded by a gate — the loop may auto-fix mechanical findings; it may not drop a specification, reword a claim, or change the abstract without you. Substantive decisions pause for a person. (The next section names the line.)
Capped — the $40 hard stop is load-bearing. A loop with no budget cap and a referee that always finds something is a machine for spending a grant on diminishing returns.

The composability is the capstone’s whole argument: the same /loop primitive that watched a download in D3 now drives a fleet, a referee, a headless re-run, and a report build. You did not buy a research-loop feature. You wrote the lab’s iteration down and handed it to a loop.

Codex Your tool

Goal Mode + @codex — you set the destination

Goal Mode (GA) drives toward an objective for as long as it takes, choosing its own intermediate steps; the GitHub @codex integration runs the same cycle as an issue loop, where each iteration is a cloud task that opens a reviewable pull request. The research iteration becomes a destination plus abort criteria:

Goal: drive the weather-mobility result to convergence.
One iteration = fan out the specification fleet under the results
contract, run the referee over results/ and the report draft, triage its
findings by severity, fix LOW findings, regenerate the report, and score
convergence (worst remaining severity).

Stopping rules:
- Converged when the worst referee severity is 0 — then stop and report.
- For any HIGH finding, STOP and open an issue for my approval before
  acting. Never drop a specification or edit the abstract autonomously.
- Hard stop at $40 of spend.

@codex: run each iteration as a cloud task; open a PR per iteration so the
diff is reviewable, and tag me on any HIGH finding.

The design burden is destination-shaped: you write the success criterion (severity 0), the abort criteria (the HIGH-finding gate, the budget cap), and the review surface (a PR per iteration). Three properties make the goal survivable, mirroring the loop’s:

A measurable objective — “worst severity is 0” is checkable at review time; “improve the robustness” is creative accounting waiting to happen.
Explicit gates and aborts — the HIGH-finding pause and the spend cap are not decoration. An objective-driven run with no abort optimizes through the night, including through the decisions it should have woken you for.
A reviewable surface — one PR per iteration makes the morning a diff, not an investigation. The @codex issue loop turns each cycle into a unit of review you can approve, request changes on, or close.

The managed delegation is the point: hours of multi-step iteration from one written brief, with the review surface built in — closer to chairing a lab than running a checklist.

Translation guide
Intent	Claude Code	Codex
drive the research iteration	/loop running the whole iteration as a recurring prompt	Goal Mode toward "severity 0", or an @codex issue loop
review each iteration	the loop reports each pass; you read the journal	one reviewable PR per iteration (the @codex loop)
gate a substantive decision	STOP-and-ask step inside the loop prompt	abort clause → open an issue, tag the human, wait
cap the spend	hard-stop budget in the loop prompt	hard-stop budget in the goal + cloud-task limits
score convergence	the loop checks worst severity each pass	the goal’s stopping rule on worst severity

The loop’s improvement record is the figure the whole capstone builds toward: referee severity falling iteration by iteration as findings are triaged and fixed, and the headline precipitation coefficient recovering from a biased +0.004 to an honest +0.009 once the loop drops the endogenous control:

The loop, converging — Referee findings by severity across four iterations of the F1 research loop: max severity falls high → high → medium → low as findings are triaged and fixed, and the report's headline precipitation β recovers from +0.004 to +0.009 once the loop drops the endogenous, post-treatment same-hour demand control (the real D4 fix). A stylized teaching exhibit — the severity trajectory is illustrative, the β endpoints are real.

iter	high	med	low	max_severity	report_beta_precip	n_specs	n_surviving	note
1	2	3	4	high	0	13	7	draft conditions on same-hour citywide demand (s13): post-treatment control; referee files endogeneity (HIGH)
2	1	3	5	high	0.01	13	11	loop drops the endogenous control; β recovers. One HIGH remains: missing storm-onset pre-trend check
3	0	2	4	medium	0.01	14	12	event-study pre-trend added (flat); FE tightened to zone+hod+date. No HIGH; mediums on sample filters
4	0	0	2	low	0.01	14	13	robustness filters reconciled; referee files no HIGH/MED. Loop converges below the severity bar; report regenerated

endogenous_spec	s13 (same-hour citywide demand control; post-treatment)
report_beta_before	0
clean_spec	s02 (identical FE, no demand control)
report_beta_after	0.01
source	figures_d D4 spec curve (real warehouse estimation)

The shared guardrails

The two philosophies run under one discipline, and it is not optional — it is what separates an automatic-research system from an unsupervised one. The guardrails are tool-neutral; they are project policy, and every prior unit was rehearsing them.

Report, don’t act (unattended default). A loop running while you sleep reads, reasons, fixes the mechanical, and reports. It does not act on anything substantive on its own. This is E2’s lab-manager discipline — the scheduled chore that surfaces a problem rather than silently “correcting” it — turned on the loop itself.
A hard budget cap. Tokens are the loop’s fuel and a referee will always find something. Without a ceiling, the loop optimizes a result that stopped improving three iterations ago. The cap is a number you set before you start, visible in the loop, enforced regardless of severity.
A human gate on substantive decisions only. This is the line, and it is worth stating precisely below — because a gate on everything is just you doing the work again, and a gate on nothing is an agent quietly rewriting your claims.

Human-gated versus mechanical

The triage rule is the heart of the system. Findings sort into two piles, and the pile decides who acts:

Finding	Example	Who handles it	Why
Mechanical	a stale rename map, a column the schema renamed, a broken cross-reference in an exhibit	the loop, on its own	there is one correct answer and no claim changes
Substantive	dropping a specification, reweighting the sample, the wording of the abstract or a conclusion	a human gate — the loop pauses and asks	the answer is a judgment, and acting on it changes what the paper claims

A mechanical fix has a right answer the machine can reach. A substantive fix changes what the result says, and “what the result says” is the one thing a one-person lab cannot outsource without ceasing to be the author. The loop in the scenario below files exactly two findings — a LOW rename and a HIGH endogenous control — and routes each to its pile in front of you.

The named failure mode

The signature failure of an automatic-research system has a name: convergence theatre. A loop pointed at “make the result look clean” will make it look clean — by dropping the inconvenient specification, shrinking the sample until a pre-trend passes, or softening the abstract until no number contradicts it. Every one of those reads as diligence in a transcript. It is the D3 goals game and D4 metric gaming failure, now operating across the whole lab at once and with the report as its alibi.

The system is built to make convergence theatre hard, not impossible: the referee argues from evidence the loop cannot fabricate (file, line, number); the human gate keeps every claim-changing decision in a person’s hands; and the budget cap stops the loop from grinding the result smooth. A loop that converges because it talked its way past the referee has not improved the result — it has staged a play about improving it, and the guardrails are the audience that does not applaud.

The report, regenerated

The written paper is not the deliverable. It is one of the system’s outputs, rebuilt from the artifacts every iteration so it can never drift from the numbers behind it. make report reads the current results/, the figures, and the tables, and regenerates the document — the same make report the loop calls after every re-run. The lab writes the report in one of two lanes, and the agents draft into whichever you chose:

Claude Code

A LaTeX lane: the paper is reports/paper.tex, the figures and tables are \input{}-ed from the build, and the agent drafts the methods, the results narration, and the figure captions from the artifacts — never the abstract, never a headline claim. You write those. The build is one command and the document is plain text you version, so a regenerated report is a reviewable diff, not a mystery.

Codex

A Quarto lane: the paper is reports/paper.qmd, the figures and tables are produced by code chunks that read the current results/, and the agent drafts the prose around them from the artifacts — again, never the abstract or a headline claim. Rendering is one command; because the exhibits are generated from the live results, a stale number in the prose is impossible by construction.

Either lane, the division of labor is the same and it is the human-gated line again, applied to writing: the agents draft from the artifacts; you own the claims and the abstract. A figure is a fact the build pulls in; a claim is a judgment you sign. The report’s two headline exhibits — the demand map and the elasticity table — are regenerated from the panel every iteration, so the document the loop ships always shows the result the loop currently holds:

Where the city hails: demand by taxi zone — Mean yellow-cab pickups per hour by NYC taxi zone over the course's 2024 slice, traced on the simplified TLC zone geometry. Demand spans 0.00 to 216 pickups/hour across 263 zones; the busiest is Midtown Center (Manhattan, 216/hr). Warm sequential ramp (paper-ochre-ink), consistent with the panel heat map. The F1 report's headline map exhibit. Illustrative run on the course's data slice.

vmin_pickups_per_hr	0
vmax_pickups_per_hr	215.96
n_zones_with_demand	263
busiest_zone	Midtown Center
busiest_borough	Manhattan
busiest_mean_pickups	215.96

What weather does to demand, by borough — Fixed-effects weather elasticities of log demand by borough (zone + hour-of-day + date FE). Rain's effect is small but positive where identified — Manhattan +0.015 log-points per mm/h [+0.010, +0.020]; temperature and snow are mostly noisy at this slice. The F1 report's results table. Illustrative run on the course's data slice.

borough	n_zone_hours	precip_beta	precip_ci_lo	precip_ci_hi	temp10_beta	temp10_ci_lo	temp10_ci_hi	snow_beta	snow_ci_lo	snow_ci_hi
Manhattan	125,165	0.01	0.01	0.02	0.02	-0.01	0.04	0.06	0.02	0.11
Brooklyn	51,930	0.01	0.01	0.02	-0.01	-0.04	0.02	-0.04	-0.11	0.03
Queens	50,378	-0	-0.01	0.01	0.03	-0	0.06	-0.02	-0.09	0.05
Bronx	20,346	0	-0	0.01	-0.02	-0.05	0.02	-0.02	-0.1	0.06
Staten Island	280	-0	-0.02	0.02	-0.04	-0.19	0.1	-0.01	-0.26	0.23
All boroughs	248,099	0.01	0.01	0.01	0	-0.02	0.02	-0	-0.04	0.03

The replication package

The last output is the one that makes the rest trustworthy: a replication package that passes its own fresh-clone self-test. make replicate clones the repo into a clean checkout, rebuilds the environment from uv.lock, runs the analysis headlessly (E1’s callable agent — claude -p / codex exec, no human at the keyboard), and checks the regenerated results against the committed manifest hashes. It is the E1 reproducibility test promoted to gate the whole paper: if the result does not rebuild from nothing, the package fails, and a result that cannot be rebuilt is not a result you can defend. The loop produces the numbers; the replication package proves a stranger can reproduce them.

Guided Run — Closing the Loop: the research that improves itself

Field Terminal — session: f1-research-loop Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Guided Run — Closing the Loop: the research that improves itself

Field Terminal — session: f1-research-loop Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make replicate passes from a fresh clone — loop converged, report regenerated, package self-tests green

Close the loop on the weather-mobility result, then prove the whole system rebuilds from nothing.

Wire the iteration. Write the research loop in your primary tool — the /loop prompt that runs fleet → referee → triage → re-run → report, or the Goal Mode / @codex issue loop with the same stages. State the convergence condition (worst severity 0), the human gate on HIGH findings, and the hard budget cap explicitly.
Run it to convergence. Let it iterate. When the referee files the HIGH finding — the endogenous control from D4 — the loop must STOP and route it to you; approve dropping the post-treatment control, then let the affected lane re-run. Auto-fix the LOW finding. Watch the severity fall and the headline coefficient recover from +0.004 to +0.009.
Regenerate the report. make report rebuilds it from the current artifacts — one component, regenerated every iteration. You own the abstract and the claims; the agents draft the rest from the figures and tables. The headline map and the elasticity table are exhibits the build pulls in, not prose you retype.
Build the replication package. make replicate runs the analysis headlessly from a clean checkout and proves it reproduces — the E1 self-test, now gating the whole paper.
Account for it. Total the roster, read the failure log, and print the Lab Charter. The loop’s own record — how many iterations it ran, how far severity fell, how many times it paused for you — is part of the honest accounting, not a victory lap.

make replicate is the capstone milestone: it passes only when the loop converged (worst severity 0), the report regenerated from the artifacts, and the package rebuilds the result from a fresh clone. Passing it means the lasting artifact is no longer the last clean table — it is the system that keeps producing clean tables after you have gone home.

Milestone gate · make replicateadvances F1

The research loop is wired in your primary tool — fleet → referee → triage → re-run → regenerate report → score convergence — with the convergence condition (worst severity 0), the human gate on HIGH findings, and a hard budget cap all stated explicitly
/loop running the whole iteration as a recurring prompt, or Goal Mode / an @codex issue loop with the same stages. The cap is a number you commit to before the loop starts.
The loop ran to convergence: the referee's HIGH finding (the endogenous control from D4) STOPPED the loop and routed to you; you approved dropping the post-treatment control and the affected lane re-ran headlessly
The mechanical LOW finding (a rename) auto-fixed; only the claim-changing HIGH paused for a person. Severity fell 2 → 0 across the iterations.
The headline precipitation coefficient recovered from a biased +0.004 to an honest +0.009 once the post-treatment control came off — the estimate weakened honestly
The tight CI was the bug, not precision: the same-hour demand control absorbed the variance precip works through.
The report regenerated from the current artifacts — `make report`, one component rebuilt every iteration, so the prose can never drift from the numbers it cites; you own the abstract and the claims
The headline map and the elasticity table are exhibits the build pulls in, not prose you retype.
The replication package self-tested: `make replicate` ran the analysis headlessly from a clean checkout and reproduced the result
The E1 fresh-clone self-test, now gating the whole paper.
The lab is accounted for: the roster totaled, the failure log read (including the loop's human gate), and the Lab Charter printed with the loop's improvement record
The honest column is the one that earns trust — a role whose lesson you have not finished prints as an OPEN POSITION, not a filled seat.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

A loop with no budget cap is a grant-shredder. A referee will always find something, and an objective framed as “keep improving” has no natural fixed point — so the loop iterates on a result that stopped moving three passes ago, billing tokens the whole way. The cap is a number you commit to before the loop starts; it is the only stopping rule that holds when convergence does not.
[both] 〜〜

Gating everything is just doing the work yourself; gating nothing lets the agent rewrite your claims. The system earns its keep only at the line: mechanical findings (renames, broken references) proceed unattended, substantive ones (dropping a spec, the abstract’s wording) pause for a person. Mislabel a claim-changing decision as mechanical and the loop will quietly edit what the paper asserts; mislabel a rename as substantive and you have rebuilt the Sunday chain you were trying to escape.
[both] 〜〜

Convergence is not correctness. A loop converges when the referee runs out of findings — which it can reach honestly (the result improved) or by theatre (the inconvenient specification was dropped, the sample was shrunk, the claim was softened). Read why it converged, not just that it did: the loop’s record should show severity falling because findings were fixed, with a human in the loop for every fix that changed a claim.
[CC]

Burying the whole iteration in one /loop prompt makes it un-auditable. If the fleet fan-out, the referee call, and the triage rule are an opaque paragraph, you cannot tell a real convergence from a talked-past one. Keep the loop prompt a readable checklist with the stages named, the gate explicit, and the cap visible — the loop is yours to re-read, the way D4’s workflow script was.
[CX]

An @codex issue loop that opens a PR per iteration but auto-merges them has removed the review surface it exists to provide. The PR is the gate; merge it yourself, especially the iteration that touches a claim. A goal run with abort clauses you never see fire is a goal run you are not actually supervising.

Check Your Bearings

F1 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

The frontier — and where you go next

The system you built is composed of primitives that are still moving. /loop and Goal Mode are converging on the same shape; @codex issue loops and Routines are both crossing into scheduled, reviewable autonomy; referee skills are getting better at demanding evidence. Treat the loop you wrote as a design you will revise, not a finished tool — the contracts between the parts are the durable thing; the primitives driving them will change under you.

Watch this space

The vendors are converging on the research loop from both ends — Claude’s local /loop plus Routines toward scheduled autonomy, Codex’s Goal Mode plus @codex toward reviewable cloud iteration. The architecture in this lesson — fleet under contract, isolated referee, headless re-run, regenerated report, human gate on claims — outlives whichever primitive wins, because it is project discipline, not a feature.

The system is also domain-agnostic. Nothing in the loop, the contracts, or the gate is specific to taxis and weather. Swap the panel for hospital admissions and air quality, or store traffic and local events, and the same architecture holds: a fleet under a results contract, a referee that argues from evidence, a report regenerated from artifacts, a human at the claims. The domain-swap is the real test of whether you learned a system or a recipe — port it, and the parts that break are the ones you were leaning on a person to hold together.

The cumulative final exam

This is the whole course, both tools, every question type — sampled across Units A through F. It is not gated; take it when you are ready to see what stuck.

Check Your Bearings

F1 · 14 questions · unlimited retries, no timer

Question 1Choose one
An agent's context window has filled with the back-and-forth of a long exploratory session and its answers are drifting. What is the right move before continuing?
Compact or summarize-and-continue — collapse the history to its decisions and keep working with a clean context
Paste the entire transcript back in so the agent does not forget anything
Lower the model's reasoning effort so it runs cheaper
Question 2Match the dialects
Match each tool to the file where its always-on lab manual lives.
Claude Code's project manual
Codex's project manual
Question 3Read the config
A pipeline-profile agent is launched to run an overnight estimation. Given the profile below, what happens when it tries to write outside results/?
```
profile: pipeline
writes: [results/, data/processed/]
network: off
```
The write is refused — the profile scopes writes to results/ and data/processed/, and a write elsewhere is blocked
The write succeeds with a warning — profiles are advisory
The agent asks for approval before writing anywhere
Question 4Choose one
What is the durable value of writing a procedure as a skill rather than re-prompting it each session?
The lab's judgment is codified so it outlives any one session — the same method runs the same way every time it is invoked
Skills run faster than prompts because they skip the model
Skills remove the need for an instruction file
Question 5Put in orderdialect check — Claude Code
A PostToolUse contract hook guards a data write. Order the events from the agent's action to the outcome.
1. The agent runs a tool that writes data/processed/panel.parquet
2. The failure surfaces to the agent, which must fix the write before proceeding
3. The hook exits non-zero on a violation, blocking the step
4. The hook script checks the schema contract (columns, null rates, row delta)
5. The PostToolUse hook fires on the completed write
Question 6Choose onestretch
You want the agent to explore a DuckDB warehouse's schema and build the analysis panel. When is registering an MCP server the right call rather than just handing it a SQL file?
When the agent needs to query the live schema repeatedly and interactively — MCP gives it a durable, typed connection to explore, not a one-shot script
Always — MCP is strictly better than running SQL directly
Never for data work — MCP is only for web browsing
Question 7Choose all that apply
Which statements about giving each parallel workstream its own worktree are correct? (Select all.)
A code-touching variant gets its own worktree so its edits never collide with another lane's
Results-only lanes that obey the results contract can safely share one tree, because the contract guarantees they never write the same path
A worktree makes the estimation itself run faster
A long run that dies mid-write in its own worktree leaves a deletable directory, not corrupted state in your main tree
Question 8Predict the outputdialect check — Claude Code
An estimation log updates every ~20 minutes. You supervise it with a /loop set to a 2-minute cadence. What do you wake up to?
```
/loop 2m Read the tail of results/logs/run.log. If SEs diverge, the
likelihood plateaus, or the process died, stop and report. Otherwise
reply OK and nothing else.
```
A clean single report — frequent checks just mean faster detection at no cost
Dozens of token-burning OKs that bury the one report that matters
Nothing — re-reading an unchanged log is free
Question 9Spot the flaw — mark the suspect linesstretch
A robustness specification reads suspiciously precise. One line on the right-hand side is doing something no honest control should. Which line is the flaw?
```
# spec s13: precipitation elasticity of log demand
y = log_pickups
rhs = [precip_mm]
rhs += fixed_effects(['zone', 'hour_of_day', 'date'])
rhs += [log_city]   # same-hour citywide demand
fit(y, rhs, cluster='zone')
```
Question 10What happens next?
A reviewer asks whether your result reproduces. You run the analysis headlessly from a clean clone and it builds the panel, fits the models, and writes the figures with no human at the keyboard. What does this prove, and what is the next step?
```
$ git clone … weather-mobility && cd weather-mobility
$ make replicate
→ environment rebuilt from uv.lock
→ agent (headless) built panel, fit models, wrote figures
→ results match the committed manifest hashes. PASS.
```
It proves the result reproduces from nothing — the agent ran as a callable function in a fresh checkout — so the replication package is the artifact you ship alongside the paper
It proves the figures look right, but reproduction still needs a human to re-run each step by hand
It proves nothing until the paper is peer-reviewed
Question 11Choose one
You put the lab's monthly data-refresh chore on a schedule. Under the 'report, don't act' discipline, what does the scheduled run do when it finds the new month's file has an unexpected extra column?
It stops, reports the schema change, and waits — it does not silently 'fix' the panel or drop the column on its own
It drops the extra column and continues, since the schema must match the old panel
It deletes the new file so the panel stays unchanged
Question 12Choose one
Why is a hard budget cap non-negotiable in the research loop, even when a clean convergence condition is also set?
Because a referee always finds something — the cap is the stopping rule that holds when convergence does not
Because the cap makes each iteration run faster
Because without it the loop cannot open a pull request
Question 13Choose all that applystretch
The F1 system composes the prior units as contracts between parts. Which of these are tool-neutral project discipline that hold regardless of which primitive drives the loop? (Select all.)
The results contract — each fleet lane writes results/{id}/ and nothing else
The isolated referee that argues from evidence (file, line, number)
The human gate on substantive decisions
The exact slash-command syntax of the loop primitive
Question 14Match the dialects
Match each job in the research loop to how each tool reaches it.
Claude Code drives the iteration by…
Codex drives the iteration by…
Either tool gates a substantive decision by…

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

The payoff, accounted

A system you cannot account for is a system you cannot trust. The three widgets below are the honest ledger of the whole lab — and “honest” is load-bearing: a role whose lesson you have not finished prints as an OPEN POSITION, not a filled seat.

First, the roster totaled — every position the lab absorbed, with the human hours it would have cost against the agent hours it took, and your own measured hours where you logged them:

The lab, totaled

Every lesson’s roster row, summed into the lab you didn’t hire. The drawer shows these one at a time; here they ink in together.

Lesson	Role absorbed	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: no interventions logged yet. Your measured hours override these estimates in your own roster.

Then the failure log — every place a human had to step in. This is the column the marketing leaves out and the one that earns trust: the hook that caught schema drift, the referee that caught the endogenous control, the loop iteration that paused for your approval. A lab that hides its interventions is selling something.

The honest column

Every place a human had to step in — what predicted it, which mechanism caught it. The accounting is browsable so the one-person-lab claim stays truthful.

No interventions logged yet. The simulations file these as they happen — and a clean column is itself an honest result.

And finally the Lab Charter — the printable record of what the system absorbed, what it caught, and how far the loop improved the result unattended. Run the research-loop scenario above first; its two iterations fill the loop’s record here, so the charter can state plainly how many iterations ran, how far severity fell, and how many times a person was gated. That line — the loop ran, the result improved, a human held the claims — is the capstone’s whole argument in one sentence:

Name on the charter

Lab Charter

Your One-Person Research Lab

chartered June 12, 2026

A lab of one, doing the work of a hall of specialists — each role absorbed by a feature you installed and can describe. This certificate reports what you have actually done, measured from your own field journal. It is accurate, not aspirational.

The roster — 0 of 9 positions absorbed

the data manageropen position
Unfilled — complete C2 to absorb this role.
the methodologistopen position
Unfilled — complete C1 to absorb this role.
the data engineeropen position
Unfilled — complete C3 to absorb this role.
the RA poolopen position
Unfilled — complete D1 to absorb this role.
the overnight RAopen position
Unfilled — complete D3 to absorb this role.
the adviseropen position
Unfilled — complete D1 to absorb this role.
the refereeopen position
Unfilled — complete D4 to absorb this role.
the lab manageropen position
Unfilled — complete E2 to absorb this role.
the reproducibility checkeropen position
Unfilled — complete E1 to absorb this role.

Milestones reached

No milestones punched yet — the route begins at A1.

Incidents your system caught

schema drift— caught by the hook (not yet reached)
an endogenous control— caught by the referee (not yet reached)

The loop’s improvement record

The loop has not yet run — F1’s research-loop scenario fills this record.

The human moved up a level. You stopped running the regressions and built the system that runs them, sends the results to something whose job is to doubt them, and waits for you at the one place a person must stand — the claims. The last clean table was never going to be the lasting artifact. The system that keeps producing clean tables, on a loop, refereed, after you have gone home — that is.

Field journal

Record the loop’s run: how many iterations it took to converge, which finding it fixed on its own and which it gated for you, and what the headline coefficient was before and after the endogenous control came off. Then write the one sentence that is the capstone’s argument — what the lasting artifact actually is.

as of June 2026

The research loop is a real philosophical split, and this page teaches it as one rather than papering over it: Claude Code composes the iteration from a local /loop you author and can re-read, with Routines as the scheduled analogue; Codex hands the runtime an objective in Goal Mode or runs the cycle as an @codex issue loop that opens a reviewable PR per iteration. Local scripted recurrence versus managed objective-driven delegation — neither tool offers the other’s primitive natively, and the loop-with-an- iteration-shaped-prompt approximates Goal Mode about as well as a checklist approximates a brief. What is tool-neutral is everything that matters most: the results contract, the isolated referee, the headless re-run, the regenerated report, the budget cap, and the human gate on substantive decisions. Those are project discipline, and they would converge the same result no matter which primitive drove the loop.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

The Pain

Why / When

Mechanics

The system assembled

The research loop

loop_iterations

real_anchor_endogenous_control

The shared guardrails

Human-gated versus mechanical

The named failure mode

The report, regenerated

✳ Claude Code

⬡ Codex

zone_mean_demand 265 rows

geometry 263 rows

demand_range

analysis_panel 248,099 rows

elasticities_by_borough

The replication package

Guided Run — Closing the Loop: the research that improves itself

Guided Run — Closing the Loop: the research that improves itself

Pitfalls & Gotchas

The frontier — and where you go next

The cumulative final exam

The payoff, accounted

The lab, totaled

The honest column

The roster — 0 of 9 positions absorbed

Milestones reached

Incidents your system caught

The loop’s improvement record

Parity note

Claude Code

Codex