Cheat sheet

D3 advanced ~45 min

Overnight Work: Loops, Goals & Background Runs

Absorbs: the overnight RA

Advances D3

The Pain

It is 1:14 a.m. and you are watching a log file scroll. The event-study with the full fixed-effects structure has been running since dinner; the bootstrap after it will run until breakfast. Every twenty minutes you alt-tab from the paper you are failing to read, tail the last hundred lines, and perform the same three checks: standard errors not exploding, likelihood still moving, disk not full. Every twenty minutes the answer is the same — fine, fine, fine — and you set another silent alarm in your head and fail to read the next page.

You cannot leave, because the one time you left, the run died at the 2 a.m. mark on a malformed storm episode and you found it at nine, seven hours of cluster time gone and the day’s plan with it. So you stay. The watching requires no judgment beyond a threshold and no skill beyond patience, which is precisely what makes it intern work — the kind a lab hands to whoever is newest, with a phone number to call if a number looks wrong. There is no intern. There is a phone number, and it is yours, and the person who would answer it is also the person whose eyes are closing now, at 1:36 a.m., as the log scrolls on: fine, fine, fine.

Why / When

The long runs in any empirical project — the estimation with the serious fixed-effects structure, the bootstrap, the overnight robustness batch — need a watcher, not a researcher. What the watcher does is checkable: read the log, compare against thresholds, escalate on divergence, stay quiet otherwise. This lesson hands that role to the machine in two philosophically different ways, and the difference is the lesson. One tool gives you a recurring check: you define what “fine” looks like, and the agent re-verifies it on a cadence, escalating when it stops being true. The other gives you a destination: you define what “done” looks like, and the agent drives toward it for hours, making its own intermediate decisions. Check versus destination; supervision versus delegation. Both absorb the overnight RA; they fail differently, and choosing between them is a design decision you will make again at every scale in this unit.

It accelerates the estimation and robustness stages of the pipeline — the parts that take wall-clock hours, not thought-hours.

Contrary winds

Not for: anything that finishes inside a coffee — supervision has overhead, and a five-minute job is cheaper to watch yourself.

Mechanics

Field note

This is an orchestration lesson, not a statistics lesson — there is nothing language-specific in it. The estimation scripts being supervised can be Python or R; the supervision patterns are identical, which is why this page declares no R variants.

This is the run you are leaving overnight — the event-study whose log scrolled past you in the Pain vignette. The supervision exists to bring back exactly this, intact, by morning:

Storm onset, hour by hour
Event-study of citywide log demand around 7 heavy-rain onsets, −6 to +12 hours (k = −1 the reference). Pre-onset coefficients sit on zero — no pre-trend, the identification check the adviser/referee demands — before a modest dip through the storm hours. Illustrative run on the course's data slice.
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · d3-event-study

citywide_hourly 2,159 rows

SELECT ts_local, sum(pickups) AS p, any_value(precipitation) AS pr FROM panel_zone_hour GROUP BY 1 ORDER BY 1

event_window

kbetaseci_loci_hi
-6-0.170.31-0.790.44
-50.010.29-0.570.59
-4-00.33-0.660.65
-3-0.010.3-0.610.58
-2-0.020.33-0.680.63
-10000
0-0.070.33-0.720.58
1-0.070.3-0.660.52
2-0.020.32-0.650.61
3-0.060.29-0.630.52
4-0.030.31-0.640.57
5-0.110.31-0.730.5
6-0.080.3-0.660.51
7-0.140.33-0.770.5
8-0.150.3-0.740.44
9-0.170.33-0.830.48
10-0.040.29-0.610.54
110.210.33-0.440.86
120.30.29-0.270.86

onsets

onset_1 2024-02-13 01:00:00
onset_2 2024-02-26 10:00:00
onset_3 2024-02-27 19:00:00
onset_4 2024-06-05 21:00:00
onset_5 2024-06-14 14:00:00
onset_6 2024-06-22 14:00:00
onset_7 2024-06-22 21:00:00

honesty note Illustrative run on the course's data slice. Onset = precipitation ≥ 1 mm/h after three dry hours (<0.2 mm), with a full −6..+12 window inside the sample: 7 events. Event-time dummies on stacked windows, absorbing event and hour-of-day fixed effects; k = −1 is the omitted reference. CIs are wide — honest at seven events; the point of the figure is the FLAT pre-trend, not the precision of the post estimates.

The two tools do not implement the same primitive here, and this page will not pretend they do. Claude Code composes small local pieces — a recurring prompt plus background tasks. Codex offers a managed, objective-driven run. Each gets its full native treatment below; the translation guide afterward maps intents, not syntax.

Claude Code Your tool

/loop — you define the check

/loop is a recurring, self-paced prompt: the agent re-runs your check on an interval (or paces itself when you omit one), acts on what it finds, and goes back to sleep. The estimation itself runs as a background task; the loop is the night nurse who reads its chart.

the night shift, in two lines
> Run scripts/estimate_event_study.py --all-episodes as a background task,
logging to results/logs/event_study.log.
> /loop 15m Read the tail of results/logs/event_study.log. If standard
errors diverge, the likelihood plateaus for more than 45 minutes, or
the process has died, stop and report what you saw with the relevant
log lines. Otherwise reply OK and nothing else.

The whole design burden is in that second prompt, and it is your-check-shaped: you must be able to write down what “fine” looks like before you go to bed. Three properties make a loop prompt work:

  • Checkable — thresholds the agent can verify from the log, not vibes (“looks healthy”) it must hallucinate an opinion about.
  • Bounded — the loop may read logs and report; it does not get to fix the estimation at 3 a.m. without you. Repairs are a morning decision.
  • Quiet by default — “reply OK and nothing else” is load-bearing; a supervisor that narrates every check trains you to ignore it, which re-creates the 2 a.m. warning problem C2 solved.

The composability is the point: the same loop pattern watches a download, a fleet (D4), or a CI queue. You are not buying an overnight feature; you are writing the night nurse’s checklist yourself.

Codex Your tool

Goal Mode — you define the destination

Goal Mode (GA) is an objective-driven run: you hand the agent a destination and stopping rules, and it drives toward them for hours — choosing intermediate steps, recovering from failures, deciding for itself what to try next.

the overnight objective
Goal: estimate the event-study across all storm episodes in the panel.
For each episode, fit the main specification and the three robustness
variants. Flag any specification where pre-trends fail joint
significance at the 5% level. Write per-episode results to
results/event_study/ under the results contract. Stop and report
immediately if anything smells like leakage — a regressor dated after
the outcome, a filter that references the treatment window. Hard stop
at 6 hours.

The design burden here is destination-shaped: you are not writing the checks, you are writing the success criteria and — more importantly — the abort criteria. Three properties make a goal survivable:

  • A measurable objective — “all episodes estimated, pre-trends flagged” is verifiable at 7 a.m.; “improve the robustness” is an invitation to creative accounting.
  • Explicit stopping rules — the leakage clause and the hard stop are not decoration. An objective-driven agent without abort criteria optimizes through the night, including through things it should have woken you for.
  • A contained blast radius — run it in a D2 worktree under the B3 pipeline profile, writing only under results/. The morning review is then a diff, not an investigation.

The managed delegation is the point: hours of unattended, multi-step progress with one written brief — closer to handing a project to a senior RA than a checklist to a junior one.

Translation guide
Intent Claude Code Codex
supervise a long-running job /loop (recurring self-paced check) + a background task running the job Goal Mode (objective-driven multi-hour run)
scheduled autonomous work Routines (cloud; cron / GitHub / API triggers) Codex cloud tasks + GitHub integration
background runs local background tasks, composable with loops and workflows cloud tasks in isolated containers, each returning a reviewable diff

The decision rubric

The decision rule is shorter than the agonizing: if you can write the check, write a loop — a bounded, checkable condition wants recurring verification. If you can only write the objective, write a goal — an open-ended optimization with a measurable endpoint wants delegation. If you can write neither, you are not ready to run it overnight, and no tool fixes that.

You can write down…The run isReach forBecause the burden is
what “fine” looks like, checkable from a logone fixed job to babysita loop watching a background taska recurring check you author once
what “done” looks like, plus what must abort itopen-ended, multi-stepan objective-driven goal runa destination with explicit stopping rules
neither, honestlynot readynothing — write the check or the objective firstjudgment you have not yet externalized

The two transcripts in the scenario below are the same night run under each philosophy. Read them side by side once before you run your own: watch where the loop chooses to speak and where the goal run chooses to decide alone, because those two moments are where each philosophy’s signature failure lives.

Both philosophies have a signature failure mode, and each is the other’s mirror:

  • Loops spam. A check every two minutes on a log that updates every twenty produces sixty token-burning “OK”s a night and a supervisor you stop reading. Match cadence to how fast the watched thing actually changes.
  • Goals game. An objective-framed prompt invites metric gaming — the specification that “passes” pre-trends because the sample quietly shrank. This is not a hypothetical; it is why D4 pairs every goal-shaped run with an adversarial referee.

And both run under the same discipline regardless of tool: the B3 pipeline profile (writes scoped to results/ and data/processed/), inside a D2 worktree, so that whatever happens at 3 a.m. happens in a sandbox your morning self can diff, merge, or delete.

Guided Run — The Night Shift: a loop you can write down

Field Terminal — session: d3-loop-goal Claude Code
claude

Guided Run — The Night Shift: a destination you can write down

Field Terminal — session: d3-loop-goal Claude Code
claude

Field Assignment

Artifact make check-d3 passes — both overnight transcripts filed and compared

Run the project’s two long jobs overnight, one under each philosophy, and bring back the transcripts. This is deliberately a both-tools exercise: the comparison is the deliverable.

  1. [CC] Start the main event-study estimation as a background task in a fresh worktree under the pipeline profile; supervise it with a loop prompt that checks divergence, plateau, and process death on a 15-minute cadence, quiet otherwise.
  2. [CX] The same evening, hand the robustness batch to Goal Mode with a measurable objective, the leakage abort clause, and a hard time stop, in its own worktree.
  3. In the morning, read both transcripts before touching results: when did the loop speak and was it right to; what did the goal run decide alone and would you have decided the same.
  4. File both transcripts and a one-page comparison in journal/ — which philosophy you’d trust with what, and why. Then make check-d3.

The comparison memo feeds D4 directly: the fleet you are about to run is goal-shaped work at scale, and the referee exists because of what you noticed in step 3.

Milestone gate · make check-d3advances D3
  1. The loop prompt must be checkable, bounded, and quiet by default.

  2. Writes scoped to results/ and data/processed/ — the 3 a.m. failure must be a deletable directory.

  3. When did the loop speak; what did the goal run decide alone; which interventions were yours.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    Objective-framed prompts invite metric gaming. “Make the pre-trends pass” can be satisfied by shrinking the sample, re-binning the event window, or dropping the inconvenient episodes — all of which look like diligence in a transcript. Pair every goal-shaped run with adversarial review (D4’s referee), and write objectives about what to estimate, never about what the answer should look like.

  • [CC]

    Over-frequent loop checks burn tokens for nothing and bury the one report that matters under sixty OKs. The cadence question is empirical: how fast does the watched thing change? Estimation logs move in tens of minutes, not seconds.

  • [both]

    Long runs outside a worktree leave half-written state in your main tree when they die at 3 a.m. — and they will, eventually, die at 3 a.m. The worktree is not optional hygiene; it is what makes the overnight failure a deletable directory instead of a forensic morning.

Check Your Bearings

D3 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

There is no isomorphism here and this page hasn’t pretended otherwise: Claude Code composes local primitives — a recurring loop, background tasks, cloud Routines for scheduled work — while Codex delegates to a managed objective-driven run, with cloud tasks as the scheduled analogue. Neither tool offers the other’s primitive natively; the loop-with-an-objective-framed-prompt approximates Goal Mode about as well as a checklist approximates a brief, which is to say usefully and incompletely. The asymmetry is a design philosophy, not a feature gap, and it is taught as such above.

Ledger — D3

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.