D3 advanced ~45 min

Overnight Work: Loops, Goals & Background Runs

Absorbs: the overnight RA

Advances D3

The Pain

It is 1:14 a.m. and you are watching a log file scroll. The event-study with the full fixed-effects structure has been running since dinner; the bootstrap after it will run until breakfast. Every twenty minutes you alt-tab from the paper you are failing to read, tail the last hundred lines, and perform the same three checks: standard errors not exploding, likelihood still moving, disk not full. Every twenty minutes the answer is the same — fine, fine, fine — and you set another silent alarm in your head and fail to read the next page.

You cannot leave, because the one time you left, the run died at the 2 a.m. mark on a malformed storm episode and you found it at nine, seven hours of cluster time gone and the day’s plan with it. So you stay. The watching requires no judgment beyond a threshold and no skill beyond patience, which is precisely what makes it intern work — the kind a lab hands to whoever is newest, with a phone number to call if a number looks wrong. There is no intern. There is a phone number, and it is yours, and the person who would answer it is also the person whose eyes are closing now, at 1:36 a.m., as the log scrolls on: fine, fine, fine.

Why / When

The long runs in any empirical project — the estimation with the serious fixed-effects structure, the bootstrap, the overnight robustness batch — need a watcher, not a researcher. What the watcher does is checkable: read the log, compare against thresholds, escalate on divergence, stay quiet otherwise. This lesson hands that role to the machine in two philosophically different ways, and the difference is the lesson. One tool gives you a recurring check: you define what “fine” looks like, and the agent re-verifies it on a cadence, escalating when it stops being true. The other gives you a destination: you define what “done” looks like, and the agent drives toward it for hours, making its own intermediate decisions. Check versus destination; supervision versus delegation. Both absorb the overnight RA; they fail differently, and choosing between them is a design decision you will make again at every scale in this unit.

It accelerates the estimation and robustness stages of the pipeline — the parts that take wall-clock hours, not thought-hours.

Contrary winds

Not for: anything that finishes inside a coffee — supervision has overhead, and a five-minute job is cheaper to watch yourself.

Mechanics

Field note

This is an orchestration lesson, not a statistics lesson — there is nothing language-specific in it. The estimation scripts being supervised can be Python or R; the supervision patterns are identical, which is why this page declares no R variants.

This is the run you are leaving overnight — the event-study whose log scrolled past you in the Pain vignette. The supervision exists to bring back exactly this, intact, by morning:

Storm onset, hour by hour — Event-study of citywide log demand around 7 heavy-rain onsets, −6 to +12 hours (k = −1 the reference). Pre-onset coefficients sit on zero — no pre-trend, the identification check the adviser/referee demands — before a modest dip through the storm hours. Illustrative run on the course's data slice.

k	beta	se	ci_lo	ci_hi
-6	-0.17	0.31	-0.79	0.44
-5	0.01	0.29	-0.57	0.59
-4	-0	0.33	-0.66	0.65
-3	-0.01	0.3	-0.61	0.58
-2	-0.02	0.33	-0.68	0.63
-1	0	0	0	0
0	-0.07	0.33	-0.72	0.58
1	-0.07	0.3	-0.66	0.52
2	-0.02	0.32	-0.65	0.61
3	-0.06	0.29	-0.63	0.52
4	-0.03	0.31	-0.64	0.57
5	-0.11	0.31	-0.73	0.5
6	-0.08	0.3	-0.66	0.51
7	-0.14	0.33	-0.77	0.5
8	-0.15	0.3	-0.74	0.44
9	-0.17	0.33	-0.83	0.48
10	-0.04	0.29	-0.61	0.54
11	0.21	0.33	-0.44	0.86
12	0.3	0.29	-0.27	0.86

onset_1	2024-02-13 01:00:00
onset_2	2024-02-26 10:00:00
onset_3	2024-02-27 19:00:00
onset_4	2024-06-05 21:00:00
onset_5	2024-06-14 14:00:00
onset_6	2024-06-22 14:00:00
onset_7	2024-06-22 21:00:00

The two tools do not implement the same primitive here, and this page will not pretend they do. Claude Code composes small local pieces — a recurring prompt plus background tasks. Codex offers a managed, objective-driven run. Each gets its full native treatment below; the translation guide afterward maps intents, not syntax.

Claude Code Your tool

/loop — you define the check

/loop is a recurring, self-paced prompt: the agent re-runs your check on an interval (or paces itself when you omit one), acts on what it finds, and goes back to sleep. The estimation itself runs as a background task; the loop is the night nurse who reads its chart.

> Run scripts/estimate_event_study.py --all-episodes as a background task,
  logging to results/logs/event_study.log.

> /loop 15m Read the tail of results/logs/event_study.log. If standard
  errors diverge, the likelihood plateaus for more than 45 minutes, or
  the process has died, stop and report what you saw with the relevant
  log lines. Otherwise reply OK and nothing else.

The whole design burden is in that second prompt, and it is your-check-shaped: you must be able to write down what “fine” looks like before you go to bed. Three properties make a loop prompt work:

Checkable — thresholds the agent can verify from the log, not vibes (“looks healthy”) it must hallucinate an opinion about.
Bounded — the loop may read logs and report; it does not get to fix the estimation at 3 a.m. without you. Repairs are a morning decision.
Quiet by default — “reply OK and nothing else” is load-bearing; a supervisor that narrates every check trains you to ignore it, which re-creates the 2 a.m. warning problem C2 solved.

The composability is the point: the same loop pattern watches a download, a fleet (D4), or a CI queue. You are not buying an overnight feature; you are writing the night nurse’s checklist yourself.

Codex Your tool

Goal Mode — you define the destination

Goal Mode (GA) is an objective-driven run: you hand the agent a destination and stopping rules, and it drives toward them for hours — choosing intermediate steps, recovering from failures, deciding for itself what to try next.

Goal: estimate the event-study across all storm episodes in the panel.
For each episode, fit the main specification and the three robustness
variants. Flag any specification where pre-trends fail joint
significance at the 5% level. Write per-episode results to
results/event_study/ under the results contract. Stop and report
immediately if anything smells like leakage — a regressor dated after
the outcome, a filter that references the treatment window. Hard stop
at 6 hours.

The design burden here is destination-shaped: you are not writing the checks, you are writing the success criteria and — more importantly — the abort criteria. Three properties make a goal survivable:

A measurable objective — “all episodes estimated, pre-trends flagged” is verifiable at 7 a.m.; “improve the robustness” is an invitation to creative accounting.
Explicit stopping rules — the leakage clause and the hard stop are not decoration. An objective-driven agent without abort criteria optimizes through the night, including through things it should have woken you for.
A contained blast radius — run it in a D2 worktree under the B3 pipeline profile, writing only under results/. The morning review is then a diff, not an investigation.

The managed delegation is the point: hours of unattended, multi-step progress with one written brief — closer to handing a project to a senior RA than a checklist to a junior one.

Translation guide
Intent	Claude Code	Codex
supervise a long-running job	/loop (recurring self-paced check) + a background task running the job	Goal Mode (objective-driven multi-hour run)
scheduled autonomous work	Routines (cloud; cron / GitHub / API triggers)	Codex cloud tasks + GitHub integration
background runs	local background tasks, composable with loops and workflows	cloud tasks in isolated containers, each returning a reviewable diff

The decision rubric

The decision rule is shorter than the agonizing: if you can write the check, write a loop — a bounded, checkable condition wants recurring verification. If you can only write the objective, write a goal — an open-ended optimization with a measurable endpoint wants delegation. If you can write neither, you are not ready to run it overnight, and no tool fixes that.

You can write down…	The run is	Reach for	Because the burden is
what “fine” looks like, checkable from a log	one fixed job to babysit	a loop watching a background task	a recurring check you author once
what “done” looks like, plus what must abort it	open-ended, multi-step	an objective-driven goal run	a destination with explicit stopping rules
neither, honestly	not ready	nothing — write the check or the objective first	judgment you have not yet externalized

The two transcripts in the scenario below are the same night run under each philosophy. Read them side by side once before you run your own: watch where the loop chooses to speak and where the goal run chooses to decide alone, because those two moments are where each philosophy’s signature failure lives.

Both philosophies have a signature failure mode, and each is the other’s mirror:

Loops spam. A check every two minutes on a log that updates every twenty produces sixty token-burning “OK”s a night and a supervisor you stop reading. Match cadence to how fast the watched thing actually changes.
Goals game. An objective-framed prompt invites metric gaming — the specification that “passes” pre-trends because the sample quietly shrank. This is not a hypothetical; it is why D4 pairs every goal-shaped run with an adversarial referee.

And both run under the same discipline regardless of tool: the B3 pipeline profile (writes scoped to results/ and data/processed/), inside a D2 worktree, so that whatever happens at 3 a.m. happens in a sandbox your morning self can diff, merge, or delete.

Guided Run — The Night Shift: a loop you can write down

Field Terminal — session: d3-loop-goal Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Guided Run — The Night Shift: a destination you can write down

Field Terminal — session: d3-loop-goal Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make check-d3 passes — both overnight transcripts filed and compared

Run the project’s two long jobs overnight, one under each philosophy, and bring back the transcripts. This is deliberately a both-tools exercise: the comparison is the deliverable.

[CC] Start the main event-study estimation as a background task in a fresh worktree under the pipeline profile; supervise it with a loop prompt that checks divergence, plateau, and process death on a 15-minute cadence, quiet otherwise.
[CX] The same evening, hand the robustness batch to Goal Mode with a measurable objective, the leakage abort clause, and a hard time stop, in its own worktree.
In the morning, read both transcripts before touching results: when did the loop speak and was it right to; what did the goal run decide alone and would you have decided the same.
File both transcripts and a one-page comparison in journal/ — which philosophy you’d trust with what, and why. Then make check-d3.

The comparison memo feeds D4 directly: the fleet you are about to run is goal-shaped work at scale, and the referee exists because of what you noticed in step 3.

Milestone gate · make check-d3advances D3

Main event-study estimation completed overnight as a background run under /loop supervision (CC)
The loop prompt must be checkable, bounded, and quiet by default.
One Goal Mode robustness batch completed overnight (CX) with abort criteria and a hard time stop
Both runs executed inside D2 worktrees under the B3 pipeline profile
Writes scoped to results/ and data/processed/ — the 3 a.m. failure must be a deletable directory.
journal/ holds both transcripts plus the one-page comparison memo
When did the loop speak; what did the goal run decide alone; which interventions were yours.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Objective-framed prompts invite metric gaming. “Make the pre-trends pass” can be satisfied by shrinking the sample, re-binning the event window, or dropping the inconvenient episodes — all of which look like diligence in a transcript. Pair every goal-shaped run with adversarial review (D4’s referee), and write objectives about what to estimate, never about what the answer should look like.
[CC]

Over-frequent loop checks burn tokens for nothing and bury the one report that matters under sixty OKs. The cadence question is empirical: how fast does the watched thing change? Estimation logs move in tens of minutes, not seconds.
[both]

Long runs outside a worktree leave half-written state in your main tree when they die at 3 a.m. — and they will, eventually, die at 3 a.m. The worktree is not optional hygiene; it is what makes the overnight failure a deletable directory instead of a forensic morning.

Check Your Bearings

D3 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

File both overnight transcripts: every time the loop spoke, every decision the goal run made alone, and which interventions were yours.

as of June 2026

There is no isomorphism here and this page hasn’t pretended otherwise: Claude Code composes local primitives — a recurring loop, background tasks, cloud Routines for scheduled work — while Codex delegates to a managed objective-driven run, with cloud tasks as the scheduled analogue. Neither tool offers the other’s primitive natively; the loop-with-an-objective-framed-prompt approximates Goal Mode about as well as a checklist approximates a brief, which is to say usefully and incompletely. The asymmetry is a design philosophy, not a feature gap, and it is taught as such above.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

Overnight Work: Loops, Goals & Background Runs

The Pain

Why / When

Mechanics

citywide_hourly 2,159 rows

event_window

onsets

/loop — you define the check

Goal Mode — you define the destination

The decision rubric

Guided Run — The Night Shift: a loop you can write down

Guided Run — The Night Shift: a destination you can write down

Field Assignment

Pitfalls & Gotchas

Check Your Bearings

Ledger — D3

The Lab Roster

Your position

Positions

Running Totals

The Pain

Why / When

Mechanics

citywide_hourly 2,159 rows

event_window

onsets

The decision rubric

Guided Run — The Night Shift: a loop you can write down

Guided Run — The Night Shift: a destination you can write down

Pitfalls & Gotchas

Parity note