Cheat sheet

F1 capstone ~150 min

Automatic Research: Closing the Loop

Absorbs: the whole lab, orchestrated — the PI who designs the system instead of doing the work

Advances F1

The Pain

The result is finished, which is to say it is finished until you touch it again. It is Sunday. The estimate is on disk, the figures are current, the draft reads cleanly, and you know — the way you know the milk is about to turn — that the whole arrangement is true only as of the last time you ran it by hand. A reviewer’s question on Tuesday will mean a new specification, which means a re-estimate, which means a new table, which means a paragraph rewritten to match the table, which means re-reading the abstract to check it still says what the numbers now say. Each of those steps is a thing you do, in order, alone, and the report is correct for exactly as long as the chain holds.

You have done this chain forty times. You are good at it. That is the problem: being good at a chain of manual passes means the quality of the work is capped at the quality of your last manual pass, on a Sunday, tired, with the milk turning. The senior people you trained under did not work this way at the end. They stopped running the regressions. They designed the thing that ran the regressions, sent the results to someone whose job was to doubt them, and read what came back. Their lasting contribution was not the last clean table. It was a process that kept producing clean tables after they had gone home — a machine for being doubted on schedule, which is most of what a result needs and the one thing a tired person on a Sunday cannot reliably supply.

Why / When

Everything in this course was a part. A1–A2 made the agent obey a brief. Unit B gave the lab a written manual and a reproducible home. Unit C made the rules enforce themselves and taught the agent the schema. Unit D dispatched a fleet under a results contract and stood up a referee that argues from evidence. Unit E made the agent a callable function, put the chores on a schedule, and shipped the methodology as a kit. F1 composes all of it into one system that runs the research on a loop and improves its own result.

The thing being absorbed here is not another role. It is the posture of the principal investigator who has stopped doing the research and started designing the system that does it: who writes the loop, sets the budget, decides which questions a machine may answer and which must pause for a person, and then reads what the system brings back. The research loop is the payoff of the whole curriculum because it is the first artifact that keeps improving the result after you leave the keyboard — and because the discipline that makes it safe (report, don’t act; a hard budget cap; a human gate on substantive decisions only) is the same discipline every prior unit was rehearsing.

It runs at the very end of the pipeline — after the panel is built, the specifications are written, and the referee exists — because a loop is only as good as the parts it orchestrates. Closing the loop is the last thing you build, and the first thing that outlives you.

Contrary winds

Not for: a one-shot answer you will never revisit — a back-of-envelope number for a meeting tomorrow does not need a loop, a referee, or a replication package, and standing one up for it is the expensive theatre of building a printing press to write a postcard.

Mechanics

Field note

This is a capstone in orchestration, not a new statistical method — the estimation it loops over can be fit in Python or R without changing a line of the loop, the contracts, or the gate. So this page declares no R variants and spends its budget on the architecture: how the parts compose, and where a person stays in the circuit. The statistics it improves are the ones you already built in Units C and D.

The system assembled

Read the prior five units not as topics but as contracts between parts. Each unit shipped one component and, more importantly, one interface the next part could rely on without asking permission:

UnitThe partThe contract it exposes
Aa directed agenta brief in, an artifact out
Bthe lab manual + reproducible homeone command rebuilds the environment from text
Cself-enforcing rules + a schema-aware agentevery write is checked; the data shape is known
Da fleet + a results contract + a refereeeach lane writes results/{id}/, nothing else; an isolated critic argues from evidence
Ea callable agent + schedules + a kitclaude -p / codex exec runs the analysis headless from a clean clone

The system is what you get when these interfaces line up end to end: the fleet (D4) produces estimates under the results contract; the referee (D4) reads those results plus the draft and files evidenced findings; the headless agent (E1) re-runs only the affected work; the report is rebuilt from the artifacts so it cannot drift; the replication package (E1) proves the whole thing rebuilds from nothing. No part reaches inside another. That is the only reason the next thing — a control loop wrapped around all of them — is even possible.

System Player film — The Research Loop
One iteration of the research loop, drawn as a closed cycle: the estimation and robustness fleet runs under the D4 results contract, an isolated referee reviews the results and the report draft, findings are triaged by severity — one high-severity finding routed to a human gate, one low-severity finding auto-fixed — the specifications, cleaning rules and report are revised in D2 worktrees, only the affected work is re-run headless, the report is regenerated, and convergence is scored; if the worst severity is still over threshold a loop-back edge returns to the top, otherwise the loop stops. ONE /loop ITERATION results + report findings HIGH → HUMAN headless OVER BAR → ITERATE UNDER BAR → STOP ESTIMATION +ROBUSTNESS FLEET under the D4 contract → results/{spec_id}/ slots ISOLATED REFEREE reads results + report never the conversation evidence-demanding TRIAGE BY SEVERITY HIGH · post-treatment control LOW · rename map, exhibit judgment waits ·mechanical proceeds REVISE IN WORKTREES specs, cleaning, report one D2 worktree each RE-RUN, HEADLESS only the affected work E1 · claude -p · codex REGENERATE REPORT report from artifacts stays true to results SCORE CONVERGENCE max severity vs. bar budget cap or stop
step 1/8

Step 1 of 8.

This is the whole lab as one machine. /loop opens an iteration by running the estimation + robustness fleet — every specification fanned out under the D4 results contract, each writing to its own results/{spec_id}/estimates.json. Twelve clean answers, no write race. That is the input to everything that follows.

The film above is the architecture in one closed cycle. The rest of this lesson is the two ways to drive that cycle, the discipline that keeps it safe, and the accounting that proves it ran.

The research loop

The loop is the new primitive, and it is the same check-versus-destination split you met supervising overnight runs in D3 — now wrapped around the whole lab instead of a single estimation. One tool composes the iteration from a recurring, self-paced check you author; the other hands the runtime an objective and stopping rules and lets it drive. Both reach the same closed cycle. Neither hides from the other — the contrast is the lesson, so read both spotlights even on your own tool.

Claude Code Your tool

/loop — you orchestrate the iteration

/loop is a recurring, self-paced prompt. Here the prompt is not a one-line log check (that was D3) — it is the entire research iteration, written as a checklist the agent re-runs until a stopping condition holds:

the research loop, in one recurring prompt
> /loop Run one research iteration:
1. Fan out the specification fleet (workflow over specifications.csv);
gate each lane on the results contract.
2. Run the referee subagent over results/ + the report draft. It must
cite file, line, and number for every finding.
3. Triage findings by severity. Auto-fix LOW findings (renames, stale
maps). For any HIGH finding, STOP and ask me to approve the change
before acting — do not drop a specification on your own.
4. Re-run only the affected lanes headlessly; regenerate the report
with `make report`.
5. If the worst remaining severity is 0, stop — converged. Otherwise
loop. Hard stop at $40 of spend regardless.

The design burden is the whole iteration written down: what runs, who reviews it, what severity routes where, what counts as converged, and what caps the spend. Three properties make a research loop survivable, and they are the same three that made an overnight loop survivable in D3 — scaled up to the lab:

  • Checkable convergence — “worst severity is 0” is a condition the referee scores, not a vibe. “Make the result better” is not a stopping rule; it is an invitation to optimize forever.
  • Bounded by a gate — the loop may auto-fix mechanical findings; it may not drop a specification, reword a claim, or change the abstract without you. Substantive decisions pause for a person. (The next section names the line.)
  • Capped — the $40 hard stop is load-bearing. A loop with no budget cap and a referee that always finds something is a machine for spending a grant on diminishing returns.

The composability is the capstone’s whole argument: the same /loop primitive that watched a download in D3 now drives a fleet, a referee, a headless re-run, and a report build. You did not buy a research-loop feature. You wrote the lab’s iteration down and handed it to a loop.

Codex Your tool

Goal Mode + @codex — you set the destination

Goal Mode (GA) drives toward an objective for as long as it takes, choosing its own intermediate steps; the GitHub @codex integration runs the same cycle as an issue loop, where each iteration is a cloud task that opens a reviewable pull request. The research iteration becomes a destination plus abort criteria:

the research loop, as a goal + an issue loop
Goal: drive the weather-mobility result to convergence.
One iteration = fan out the specification fleet under the results
contract, run the referee over results/ and the report draft, triage its
findings by severity, fix LOW findings, regenerate the report, and score
convergence (worst remaining severity).
Stopping rules:
- Converged when the worst referee severity is 0 — then stop and report.
- For any HIGH finding, STOP and open an issue for my approval before
acting. Never drop a specification or edit the abstract autonomously.
- Hard stop at $40 of spend.
@codex: run each iteration as a cloud task; open a PR per iteration so the
diff is reviewable, and tag me on any HIGH finding.

The design burden is destination-shaped: you write the success criterion (severity 0), the abort criteria (the HIGH-finding gate, the budget cap), and the review surface (a PR per iteration). Three properties make the goal survivable, mirroring the loop’s:

  • A measurable objective — “worst severity is 0” is checkable at review time; “improve the robustness” is creative accounting waiting to happen.
  • Explicit gates and aborts — the HIGH-finding pause and the spend cap are not decoration. An objective-driven run with no abort optimizes through the night, including through the decisions it should have woken you for.
  • A reviewable surface — one PR per iteration makes the morning a diff, not an investigation. The @codex issue loop turns each cycle into a unit of review you can approve, request changes on, or close.

The managed delegation is the point: hours of multi-step iteration from one written brief, with the review surface built in — closer to chairing a lab than running a checklist.

Translation guide
Intent Claude Code Codex
drive the research iteration /loop running the whole iteration as a recurring prompt Goal Mode toward "severity 0", or an @codex issue loop
review each iteration the loop reports each pass; you read the journal one reviewable PR per iteration (the @codex loop)
gate a substantive decision STOP-and-ask step inside the loop prompt abort clause → open an issue, tag the human, wait
cap the spend hard-stop budget in the loop prompt hard-stop budget in the goal + cloud-task limits
score convergence the loop checks worst severity each pass the goal’s stopping rule on worst severity

The loop’s improvement record is the figure the whole capstone builds toward: referee severity falling iteration by iteration as findings are triaged and fixed, and the headline precipitation coefficient recovering from a biased +0.004 to an honest +0.009 once the loop drops the endogenous control:

The loop, converging
Referee findings by severity across four iterations of the F1 research loop: max severity falls high → high → medium → low as findings are triaged and fixed, and the report's headline precipitation β recovers from +0.004 to +0.009 once the loop drops the endogenous, post-treatment same-hour demand control (the real D4 fix). A stylized teaching exhibit — the severity trajectory is illustrative, the β endpoints are real.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · f1-loop-improvement

loop_iterations

iterhighmedlowmax_severityreport_beta_precipn_specsn_survivingnote
1234high0137draft conditions on same-hour citywide demand (s13): post-treatment control; referee files endogeneity (HIGH)
2135high0.011311loop drops the endogenous control; β recovers. One HIGH remains: missing storm-onset pre-trend check
3024medium0.011412event-study pre-trend added (flat); FE tightened to zone+hod+date. No HIGH; mediums on sample filters
4002low0.011413robustness filters reconciled; referee files no HIGH/MED. Loop converges below the severity bar; report regenerated

real_anchor_endogenous_control

endogenous_spec s13 (same-hour citywide demand control; post-treatment)
report_beta_before 0
clean_spec s02 (identical FE, no demand control)
report_beta_after 0.01
source figures_d D4 spec curve (real warehouse estimation)

honesty note STYLIZED LOOP EXHIBIT — a teaching illustration, not a claim about a specific recorded run. The per-iteration severity counts (high/medium/low) are an illustrative convergence trajectory for the F1 loop on the course slice; they are NOT a log of an actual referee session. What IS real: the loop's headline fix and its two β endpoints. Iteration 1's report conditions on same-hour citywide demand (D4 spec s13), an endogenous / post-treatment control that collapses the precipitation coefficient to +0.0038; the loop drops that control and the coefficient recovers to +0.0087 (clean spec s02, identical fixed effects). Both numbers come from the same warehouse estimation that produces the D4 specification curve. The exhibit illustrates how an adversarial referee drives the loop below a severity bar; it does not assert these exact counts occurred.

The shared guardrails

The two philosophies run under one discipline, and it is not optional — it is what separates an automatic-research system from an unsupervised one. The guardrails are tool-neutral; they are project policy, and every prior unit was rehearsing them.

  • Report, don’t act (unattended default). A loop running while you sleep reads, reasons, fixes the mechanical, and reports. It does not act on anything substantive on its own. This is E2’s lab-manager discipline — the scheduled chore that surfaces a problem rather than silently “correcting” it — turned on the loop itself.
  • A hard budget cap. Tokens are the loop’s fuel and a referee will always find something. Without a ceiling, the loop optimizes a result that stopped improving three iterations ago. The cap is a number you set before you start, visible in the loop, enforced regardless of severity.
  • A human gate on substantive decisions only. This is the line, and it is worth stating precisely below — because a gate on everything is just you doing the work again, and a gate on nothing is an agent quietly rewriting your claims.

Human-gated versus mechanical

The triage rule is the heart of the system. Findings sort into two piles, and the pile decides who acts:

FindingExampleWho handles itWhy
Mechanicala stale rename map, a column the schema renamed, a broken cross-reference in an exhibitthe loop, on its ownthere is one correct answer and no claim changes
Substantivedropping a specification, reweighting the sample, the wording of the abstract or a conclusiona human gate — the loop pauses and asksthe answer is a judgment, and acting on it changes what the paper claims

A mechanical fix has a right answer the machine can reach. A substantive fix changes what the result says, and “what the result says” is the one thing a one-person lab cannot outsource without ceasing to be the author. The loop in the scenario below files exactly two findings — a LOW rename and a HIGH endogenous control — and routes each to its pile in front of you.

The named failure mode

The signature failure of an automatic-research system has a name: convergence theatre. A loop pointed at “make the result look clean” will make it look clean — by dropping the inconvenient specification, shrinking the sample until a pre-trend passes, or softening the abstract until no number contradicts it. Every one of those reads as diligence in a transcript. It is the D3 goals game and D4 metric gaming failure, now operating across the whole lab at once and with the report as its alibi.

The system is built to make convergence theatre hard, not impossible: the referee argues from evidence the loop cannot fabricate (file, line, number); the human gate keeps every claim-changing decision in a person’s hands; and the budget cap stops the loop from grinding the result smooth. A loop that converges because it talked its way past the referee has not improved the result — it has staged a play about improving it, and the guardrails are the audience that does not applaud.

The report, regenerated

The written paper is not the deliverable. It is one of the system’s outputs, rebuilt from the artifacts every iteration so it can never drift from the numbers behind it. make report reads the current results/, the figures, and the tables, and regenerates the document — the same make report the loop calls after every re-run. The lab writes the report in one of two lanes, and the agents draft into whichever you chose:

Claude Code

A LaTeX lane: the paper is reports/paper.tex, the figures and tables are \input{}-ed from the build, and the agent drafts the methods, the results narration, and the figure captions from the artifacts — never the abstract, never a headline claim. You write those. The build is one command and the document is plain text you version, so a regenerated report is a reviewable diff, not a mystery.

Codex

A Quarto lane: the paper is reports/paper.qmd, the figures and tables are produced by code chunks that read the current results/, and the agent drafts the prose around them from the artifacts — again, never the abstract or a headline claim. Rendering is one command; because the exhibits are generated from the live results, a stale number in the prose is impossible by construction.

Either lane, the division of labor is the same and it is the human-gated line again, applied to writing: the agents draft from the artifacts; you own the claims and the abstract. A figure is a fact the build pulls in; a claim is a judgment you sign. The report’s two headline exhibits — the demand map and the elasticity table — are regenerated from the panel every iteration, so the document the loop ships always shows the result the loop currently holds:

Where the city hails: demand by taxi zone
Mean yellow-cab pickups per hour by NYC taxi zone over the course's 2024 slice, traced on the simplified TLC zone geometry. Demand spans 0.00 to 216 pickups/hour across 263 zones; the busiest is Midtown Center (Manhattan, 216/hr). Warm sequential ramp (paper-ochre-ink), consistent with the panel heat map. The F1 report's headline map exhibit. Illustrative run on the course's data slice.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · f1-zone-choropleth

zone_mean_demand 265 rows

SELECT location_id, any_value(borough) AS borough, avg(pickups) AS mean_pickups, count(*) AS n_hours FROM panel_zone_hour GROUP BY location_id

geometry 263 rows

out/geo/taxi_zones_simplified.geojson (simplified TLC taxi_zones; A3 map export; presentation join on LocationID, never a spatial join)

demand_range

vmin_pickups_per_hr 0
vmax_pickups_per_hr 215.96
n_zones_with_demand 263
busiest_zone Midtown Center
busiest_borough Manhattan
busiest_mean_pickups 215.96

honesty note Illustrative run on the course's 2024-02/03/06 slice: per-zone mean pickups/hour from panel_zone_hour (zero cells kept, so a quiet zone reads low, not missing). The map traces the simplified zone geometry exported for the A3 overworld map (out/geo/taxi_zones_simplified.geojson) and joins it to the warehouse on LocationID — a presentation join only; the warehouse never joins on geometry. Airport / outside-NYC lookup rows have no panel demand and draw as empty paper, not zero. The ramp is gamma-compressed (0.5) for legibility, so color encodes rank-ish magnitude, not a linear scale; the legend ticks are the real pickup values.

What weather does to demand, by borough
Fixed-effects weather elasticities of log demand by borough (zone + hour-of-day + date FE). Rain's effect is small but positive where identified — Manhattan +0.015 log-points per mm/h [+0.010, +0.020]; temperature and snow are mostly noisy at this slice. The F1 report's results table. Illustrative run on the course's data slice.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · f1-elasticity-table

analysis_panel 248,099 rows

SELECT location_id, borough, ts_local, pickups, precipitation, temperature_2m, snowfall FROM panel_zone_hour WHERE borough IN ('Manhattan','Brooklyn','Queens','Bronx','Staten Island')

elasticities_by_borough

boroughn_zone_hoursprecip_betaprecip_ci_loprecip_ci_hitemp10_betatemp10_ci_lotemp10_ci_hisnow_betasnow_ci_losnow_ci_hi
Manhattan125,1650.010.010.020.02-0.010.040.060.020.11
Brooklyn51,9300.010.010.02-0.01-0.040.02-0.04-0.110.03
Queens50,378-0-0.010.010.03-00.06-0.02-0.090.05
Bronx20,3460-00.01-0.02-0.050.02-0.02-0.10.06
Staten Island280-0-0.020.02-0.04-0.190.1-0.01-0.260.23
All boroughs248,0990.010.010.010-0.020.02-0-0.040.03

honesty note Illustrative run on the course's 2024-02/03/06 slice. Outcome is log(pickups) over nonzero zone-hours so coefficients are log-points; precip is per mm/h, temperature per 10 °C, snow per cm/h. Zone + hour-of-day + date fixed effects absorbed by iterative within-demeaning (FWL), dof-corrected for the absorbed dummies; 95% classical CIs. β is rubric-flagged where the CI excludes zero. Snow has few nonzero hours in this three-month slice, so its CIs are wide — that is honest imprecision, not a measured null. Staten Island carries only 280 nonzero zone-hours (yellow cabs barely serve it); its row is near-uninformative by construction.

The replication package

The last output is the one that makes the rest trustworthy: a replication package that passes its own fresh-clone self-test. make replicate clones the repo into a clean checkout, rebuilds the environment from uv.lock, runs the analysis headlessly (E1’s callable agent — claude -p / codex exec, no human at the keyboard), and checks the regenerated results against the committed manifest hashes. It is the E1 reproducibility test promoted to gate the whole paper: if the result does not rebuild from nothing, the package fails, and a result that cannot be rebuilt is not a result you can defend. The loop produces the numbers; the replication package proves a stranger can reproduce them.

Guided Run — Closing the Loop: the research that improves itself

Field Terminal — session: f1-research-loop Claude Code
claude

Guided Run — Closing the Loop: the research that improves itself

Field Terminal — session: f1-research-loop Claude Code
claude

Field Assignment

Artifact make replicate passes from a fresh clone — loop converged, report regenerated, package self-tests green

Close the loop on the weather-mobility result, then prove the whole system rebuilds from nothing.

  1. Wire the iteration. Write the research loop in your primary tool — the /loop prompt that runs fleet → referee → triage → re-run → report, or the Goal Mode / @codex issue loop with the same stages. State the convergence condition (worst severity 0), the human gate on HIGH findings, and the hard budget cap explicitly.
  2. Run it to convergence. Let it iterate. When the referee files the HIGH finding — the endogenous control from D4 — the loop must STOP and route it to you; approve dropping the post-treatment control, then let the affected lane re-run. Auto-fix the LOW finding. Watch the severity fall and the headline coefficient recover from +0.004 to +0.009.
  3. Regenerate the report. make report rebuilds it from the current artifacts — one component, regenerated every iteration. You own the abstract and the claims; the agents draft the rest from the figures and tables. The headline map and the elasticity table are exhibits the build pulls in, not prose you retype.
  4. Build the replication package. make replicate runs the analysis headlessly from a clean checkout and proves it reproduces — the E1 self-test, now gating the whole paper.
  5. Account for it. Total the roster, read the failure log, and print the Lab Charter. The loop’s own record — how many iterations it ran, how far severity fell, how many times it paused for you — is part of the honest accounting, not a victory lap.

make replicate is the capstone milestone: it passes only when the loop converged (worst severity 0), the report regenerated from the artifacts, and the package rebuilds the result from a fresh clone. Passing it means the lasting artifact is no longer the last clean table — it is the system that keeps producing clean tables after you have gone home.

Milestone gate · make replicateadvances F1
  1. /loop running the whole iteration as a recurring prompt, or Goal Mode / an @codex issue loop with the same stages. The cap is a number you commit to before the loop starts.

  2. The mechanical LOW finding (a rename) auto-fixed; only the claim-changing HIGH paused for a person. Severity fell 2 → 0 across the iterations.

  3. The tight CI was the bug, not precision: the same-hour demand control absorbed the variance precip works through.

  4. The headline map and the elasticity table are exhibits the build pulls in, not prose you retype.

  5. The E1 fresh-clone self-test, now gating the whole paper.

  6. The honest column is the one that earns trust — a role whose lesson you have not finished prints as an OPEN POSITION, not a filled seat.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    A loop with no budget cap is a grant-shredder. A referee will always find something, and an objective framed as “keep improving” has no natural fixed point — so the loop iterates on a result that stopped moving three passes ago, billing tokens the whole way. The cap is a number you commit to before the loop starts; it is the only stopping rule that holds when convergence does not.

  • [both] 〜〜

    Gating everything is just doing the work yourself; gating nothing lets the agent rewrite your claims. The system earns its keep only at the line: mechanical findings (renames, broken references) proceed unattended, substantive ones (dropping a spec, the abstract’s wording) pause for a person. Mislabel a claim-changing decision as mechanical and the loop will quietly edit what the paper asserts; mislabel a rename as substantive and you have rebuilt the Sunday chain you were trying to escape.

  • [both] 〜〜

    Convergence is not correctness. A loop converges when the referee runs out of findings — which it can reach honestly (the result improved) or by theatre (the inconvenient specification was dropped, the sample was shrunk, the claim was softened). Read why it converged, not just that it did: the loop’s record should show severity falling because findings were fixed, with a human in the loop for every fix that changed a claim.

  • [CC]

    Burying the whole iteration in one /loop prompt makes it un-auditable. If the fleet fan-out, the referee call, and the triage rule are an opaque paragraph, you cannot tell a real convergence from a talked-past one. Keep the loop prompt a readable checklist with the stages named, the gate explicit, and the cap visible — the loop is yours to re-read, the way D4’s workflow script was.

  • [CX]

    An @codex issue loop that opens a PR per iteration but auto-merges them has removed the review surface it exists to provide. The PR is the gate; merge it yourself, especially the iteration that touches a claim. A goal run with abort clauses you never see fire is a goal run you are not actually supervising.

Check Your Bearings

F1 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The frontier — and where you go next

The system you built is composed of primitives that are still moving. /loop and Goal Mode are converging on the same shape; @codex issue loops and Routines are both crossing into scheduled, reviewable autonomy; referee skills are getting better at demanding evidence. Treat the loop you wrote as a design you will revise, not a finished tool — the contracts between the parts are the durable thing; the primitives driving them will change under you.

Watch this space

The vendors are converging on the research loop from both ends — Claude’s local /loop plus Routines toward scheduled autonomy, Codex’s Goal Mode plus @codex toward reviewable cloud iteration. The architecture in this lesson — fleet under contract, isolated referee, headless re-run, regenerated report, human gate on claims — outlives whichever primitive wins, because it is project discipline, not a feature.

The system is also domain-agnostic. Nothing in the loop, the contracts, or the gate is specific to taxis and weather. Swap the panel for hospital admissions and air quality, or store traffic and local events, and the same architecture holds: a fleet under a results contract, a referee that argues from evidence, a report regenerated from artifacts, a human at the claims. The domain-swap is the real test of whether you learned a system or a recipe — port it, and the parts that break are the ones you were leaning on a person to hold together.

The cumulative final exam

This is the whole course, both tools, every question type — sampled across Units A through F. It is not gated; take it when you are ready to see what stuck.

Check Your Bearings

F1 · 14 questions · unlimited retries, no timer
  1. Question 1Choose one

    An agent's context window has filled with the back-and-forth of a long exploratory session and its answers are drifting. What is the right move before continuing?

  2. Question 2Match the dialects

    Match each tool to the file where its always-on lab manual lives.

    Claude Code's project manual
    Codex's project manual
  3. Question 3Read the config

    A pipeline-profile agent is launched to run an overnight estimation. Given the profile below, what happens when it tries to write outside results/?

    profile: pipeline
    writes: [results/, data/processed/]
    network: off
  4. Question 4Choose one

    What is the durable value of writing a procedure as a skill rather than re-prompting it each session?

  5. Question 5Put in orderdialect check — Claude Code

    A PostToolUse contract hook guards a data write. Order the events from the agent's action to the outcome.

    1. The agent runs a tool that writes data/processed/panel.parquet
    2. The failure surfaces to the agent, which must fix the write before proceeding
    3. The hook exits non-zero on a violation, blocking the step
    4. The hook script checks the schema contract (columns, null rates, row delta)
    5. The PostToolUse hook fires on the completed write
  6. Question 6Choose onestretch

    You want the agent to explore a DuckDB warehouse's schema and build the analysis panel. When is registering an MCP server the right call rather than just handing it a SQL file?

  7. Question 7Choose all that apply

    Which statements about giving each parallel workstream its own worktree are correct? (Select all.)

  8. Question 8Predict the outputdialect check — Claude Code

    An estimation log updates every ~20 minutes. You supervise it with a /loop set to a 2-minute cadence. What do you wake up to?

    /loop 2m Read the tail of results/logs/run.log. If SEs diverge, the
    likelihood plateaus, or the process died, stop and report. Otherwise
    reply OK and nothing else.
  9. Question 9Spot the flaw — mark the suspect linesstretch

    A robustness specification reads suspiciously precise. One line on the right-hand side is doing something no honest control should. Which line is the flaw?

    # spec s13: precipitation elasticity of log demand
    y = log_pickups
    rhs = [precip_mm]
    rhs += fixed_effects(['zone', 'hour_of_day', 'date'])
    rhs += [log_city]   # same-hour citywide demand
    fit(y, rhs, cluster='zone')
  10. Question 10What happens next?

    A reviewer asks whether your result reproduces. You run the analysis headlessly from a clean clone and it builds the panel, fits the models, and writes the figures with no human at the keyboard. What does this prove, and what is the next step?

    $ git clone … weather-mobility && cd weather-mobility
    $ make replicate
    → environment rebuilt from uv.lock
    → agent (headless) built panel, fit models, wrote figures
    → results match the committed manifest hashes. PASS.
  11. Question 11Choose one

    You put the lab's monthly data-refresh chore on a schedule. Under the 'report, don't act' discipline, what does the scheduled run do when it finds the new month's file has an unexpected extra column?

  12. Question 12Choose one

    Why is a hard budget cap non-negotiable in the research loop, even when a clean convergence condition is also set?

  13. Question 13Choose all that applystretch

    The F1 system composes the prior units as contracts between parts. Which of these are tool-neutral project discipline that hold regardless of which primitive drives the loop? (Select all.)

  14. Question 14Match the dialects

    Match each job in the research loop to how each tool reaches it.

    Claude Code drives the iteration by…
    Codex drives the iteration by…
    Either tool gates a substantive decision by…

The payoff, accounted

A system you cannot account for is a system you cannot trust. The three widgets below are the honest ledger of the whole lab — and “honest” is load-bearing: a role whose lesson you have not finished prints as an OPEN POSITION, not a filled seat.

First, the roster totaled — every position the lab absorbed, with the human hours it would have cost against the agent hours it took, and your own measured hours where you logged them:

The lab, totaled

Every lesson’s roster row, summed into the lab you didn’t hire. The drawer shows these one at a time; here they ink in together.

LessonRole absorbedEst. human-RAAgent (yours when measured)
A1the wall — the unstaffed midnight hours between a raw file and a first plotan evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work~10 minutes for the quick win, plus the same task re-run in the other language for free
A2you, working an order of magnitude faster — but only if you direct the workthe slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1the lab manual nobody writes — the institutional knowledge that lives in your head~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote downwritten once in an hour; reloaded free at the start of every session thereafter
B2careful senior who plans before touching data~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rotsan afternoon — most of it download wall-clock, not attention
B3the data manager who guards the raw files — the person who says no near the master copiespermanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloadstwo profiles configured once in minutes; the fence then holds every session, tired or not
C1the methodologist — the one person who knows how the lab actually decidesthe judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they doan afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2data manager / QA who never sleepspermanent vigilance — est. 2 weeks/year of load-checking and release-note readinghalf a day to install and test the 9-line block; ~20 s per run thereafter
C3the data engineer who wires the lab to its systemsdays of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changesregister the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1the RA pool — and the adviser who critiques from outsidea week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2the lab whose members don't overwrite each otherthe lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first timetwo commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3overnight RAone night shift per estimation batch — and the course runs several batches~10 min to write the check or the objective; the night itself belongs to the machine
D4an RA bench and the PI who keeps their results comparablethe curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1reproducibility checkera clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2lab manager's standing choresa recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3the onboarding the lab never has to repeatsix weeks of per-member onboarding, rediscovered from scratch every time the lab turns over~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1the whole lab, orchestrated — the PI who designs the system instead of doing the workeach revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried editsthe loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed0 of 16

The honest column: no interventions logged yet. Your measured hours override these estimates in your own roster.

Then the failure log — every place a human had to step in. This is the column the marketing leaves out and the one that earns trust: the hook that caught schema drift, the referee that caught the endogenous control, the loop iteration that paused for your approval. A lab that hides its interventions is selling something.

The honest column

Every place a human had to step in — what predicted it, which mechanism caught it. The accounting is browsable so the one-person-lab claim stays truthful.

No interventions logged yet. The simulations file these as they happen — and a clean column is itself an honest result.

And finally the Lab Charter — the printable record of what the system absorbed, what it caught, and how far the loop improved the result unattended. Run the research-loop scenario above first; its two iterations fill the loop’s record here, so the charter can state plainly how many iterations ran, how far severity fell, and how many times a person was gated. That line — the loop ran, the result improved, a human held the claims — is the capstone’s whole argument in one sentence:

Lab Charter

Your One-Person Research Lab

chartered June 12, 2026

A lab of one, doing the work of a hall of specialists — each role absorbed by a feature you installed and can describe. This certificate reports what you have actually done, measured from your own field journal. It is accurate, not aspirational.

The roster 0 of 9 positions absorbed

  • the data manageropen position

    Unfilled — complete C2 to absorb this role.

  • the methodologistopen position

    Unfilled — complete C1 to absorb this role.

  • the data engineeropen position

    Unfilled — complete C3 to absorb this role.

  • the RA poolopen position

    Unfilled — complete D1 to absorb this role.

  • the overnight RAopen position

    Unfilled — complete D3 to absorb this role.

  • the adviseropen position

    Unfilled — complete D1 to absorb this role.

  • the refereeopen position

    Unfilled — complete D4 to absorb this role.

  • the lab manageropen position

    Unfilled — complete E2 to absorb this role.

  • the reproducibility checkeropen position

    Unfilled — complete E1 to absorb this role.

Milestones reached

No milestones punched yet — the route begins at A1.

Incidents your system caught

  • schema driftcaught by the hook (not yet reached)
  • an endogenous controlcaught by the referee (not yet reached)

The loop’s improvement record

The loop has not yet run — F1’s research-loop scenario fills this record.

This lab did its work including 0 interventions where a human had to step in. The honest column lives in the field journal’s failure log; the one-person-lab claim is the measured output of an automatic system, not a boast.

The human moved up a level. You stopped running the regressions and built the system that runs them, sends the results to something whose job is to doubt them, and waits for you at the one place a person must stand — the claims. The last clean table was never going to be the lasting artifact. The system that keeps producing clean tables, on a loop, refereed, after you have gone home — that is.

Field journal

as of June 2026

Parity note

The research loop is a real philosophical split, and this page teaches it as one rather than papering over it: Claude Code composes the iteration from a local /loop you author and can re-read, with Routines as the scheduled analogue; Codex hands the runtime an objective in Goal Mode or runs the cycle as an @codex issue loop that opens a reviewable PR per iteration. Local scripted recurrence versus managed objective-driven delegation — neither tool offers the other’s primitive natively, and the loop-with-an- iteration-shaped-prompt approximates Goal Mode about as well as a checklist approximates a brief. What is tool-neutral is everything that matters most: the results contract, the isolated referee, the headless re-run, the regenerated report, the budget cap, and the human gate on substantive decisions. Those are project discipline, and they would converge the same result no matter which primitive drove the loop.

Ledger — F1

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.