The Pain
It is 1:14 a.m. and you are watching a log file scroll. The event-study with the full fixed-effects structure has been running since dinner; the bootstrap after it will run until breakfast. Every twenty minutes you alt-tab from the paper you are failing to read, tail the last hundred lines, and perform the same three checks: standard errors not exploding, likelihood still moving, disk not full. Every twenty minutes the answer is the same — fine, fine, fine — and you set another silent alarm in your head and fail to read the next page.
You cannot leave, because the one time you left, the run died at the 2 a.m. mark on a malformed storm episode and you found it at nine, seven hours of cluster time gone and the day’s plan with it. So you stay. The watching requires no judgment beyond a threshold and no skill beyond patience, which is precisely what makes it intern work — the kind a lab hands to whoever is newest, with a phone number to call if a number looks wrong. There is no intern. There is a phone number, and it is yours, and the person who would answer it is also the person whose eyes are closing now, at 1:36 a.m., as the log scrolls on: fine, fine, fine.
Why / When
The long runs in any empirical project — the estimation with the serious fixed-effects structure, the bootstrap, the overnight robustness batch — need a watcher, not a researcher. What the watcher does is checkable: read the log, compare against thresholds, escalate on divergence, stay quiet otherwise. This lesson hands that role to the machine in two philosophically different ways, and the difference is the lesson. One tool gives you a recurring check: you define what “fine” looks like, and the agent re-verifies it on a cadence, escalating when it stops being true. The other gives you a destination: you define what “done” looks like, and the agent drives toward it for hours, making its own intermediate decisions. Check versus destination; supervision versus delegation. Both absorb the overnight RA; they fail differently, and choosing between them is a design decision you will make again at every scale in this unit.
It accelerates the estimation and robustness stages of the pipeline — the parts that take wall-clock hours, not thought-hours.
Contrary winds
Not for: anything that finishes inside a coffee — supervision has overhead, and a five-minute job is cheaper to watch yourself.
Mechanics
Field note
This is an orchestration lesson, not a statistics lesson — there is nothing language-specific in it. The estimation scripts being supervised can be Python or R; the supervision patterns are identical, which is why this page declares no R variants.
This is the run you are leaving overnight — the event-study whose log scrolled past you in the Pain vignette. The supervision exists to bring back exactly this, intact, by morning:
the numbers behind this figure
citywide_hourly 2,159 rows
SELECT ts_local, sum(pickups) AS p, any_value(precipitation) AS pr FROM panel_zone_hour GROUP BY 1 ORDER BY 1 event_window
| k | beta | se | ci_lo | ci_hi |
|---|---|---|---|---|
| -6 | -0.17 | 0.31 | -0.79 | 0.44 |
| -5 | 0.01 | 0.29 | -0.57 | 0.59 |
| -4 | -0 | 0.33 | -0.66 | 0.65 |
| -3 | -0.01 | 0.3 | -0.61 | 0.58 |
| -2 | -0.02 | 0.33 | -0.68 | 0.63 |
| -1 | 0 | 0 | 0 | 0 |
| 0 | -0.07 | 0.33 | -0.72 | 0.58 |
| 1 | -0.07 | 0.3 | -0.66 | 0.52 |
| 2 | -0.02 | 0.32 | -0.65 | 0.61 |
| 3 | -0.06 | 0.29 | -0.63 | 0.52 |
| 4 | -0.03 | 0.31 | -0.64 | 0.57 |
| 5 | -0.11 | 0.31 | -0.73 | 0.5 |
| 6 | -0.08 | 0.3 | -0.66 | 0.51 |
| 7 | -0.14 | 0.33 | -0.77 | 0.5 |
| 8 | -0.15 | 0.3 | -0.74 | 0.44 |
| 9 | -0.17 | 0.33 | -0.83 | 0.48 |
| 10 | -0.04 | 0.29 | -0.61 | 0.54 |
| 11 | 0.21 | 0.33 | -0.44 | 0.86 |
| 12 | 0.3 | 0.29 | -0.27 | 0.86 |
onsets
| onset_1 | 2024-02-13 01:00:00 |
|---|---|
| onset_2 | 2024-02-26 10:00:00 |
| onset_3 | 2024-02-27 19:00:00 |
| onset_4 | 2024-06-05 21:00:00 |
| onset_5 | 2024-06-14 14:00:00 |
| onset_6 | 2024-06-22 14:00:00 |
| onset_7 | 2024-06-22 21:00:00 |
honesty note Illustrative run on the course's data slice. Onset = precipitation ≥ 1 mm/h after three dry hours (<0.2 mm), with a full −6..+12 window inside the sample: 7 events. Event-time dummies on stacked windows, absorbing event and hour-of-day fixed effects; k = −1 is the omitted reference. CIs are wide — honest at seven events; the point of the figure is the FLAT pre-trend, not the precision of the post estimates.
The two tools do not implement the same primitive here, and this page will not pretend they do. Claude Code composes small local pieces — a recurring prompt plus background tasks. Codex offers a managed, objective-driven run. Each gets its full native treatment below; the translation guide afterward maps intents, not syntax.
Claude Code Your tool
/loop — you define the check
/loop is a recurring, self-paced prompt: the agent re-runs your
check on an interval (or paces itself when you omit one), acts on what
it finds, and goes back to sleep. The estimation itself runs as a
background task; the loop is the night nurse who reads its chart.
> Run scripts/estimate_event_study.py --all-episodes as a background task, logging to results/logs/event_study.log.
> /loop 15m Read the tail of results/logs/event_study.log. If standard errors diverge, the likelihood plateaus for more than 45 minutes, or the process has died, stop and report what you saw with the relevant log lines. Otherwise reply OK and nothing else.The whole design burden is in that second prompt, and it is your-check-shaped: you must be able to write down what “fine” looks like before you go to bed. Three properties make a loop prompt work:
- Checkable — thresholds the agent can verify from the log, not vibes (“looks healthy”) it must hallucinate an opinion about.
- Bounded — the loop may read logs and report; it does not get to fix the estimation at 3 a.m. without you. Repairs are a morning decision.
- Quiet by default — “reply OK and nothing else” is load-bearing; a supervisor that narrates every check trains you to ignore it, which re-creates the 2 a.m. warning problem C2 solved.
The composability is the point: the same loop pattern watches a download, a fleet (D4), or a CI queue. You are not buying an overnight feature; you are writing the night nurse’s checklist yourself.
Codex Your tool
Goal Mode — you define the destination
Goal Mode (GA) is an objective-driven run: you hand the agent a destination and stopping rules, and it drives toward them for hours — choosing intermediate steps, recovering from failures, deciding for itself what to try next.
Goal: estimate the event-study across all storm episodes in the panel.For each episode, fit the main specification and the three robustnessvariants. Flag any specification where pre-trends fail jointsignificance at the 5% level. Write per-episode results toresults/event_study/ under the results contract. Stop and reportimmediately if anything smells like leakage — a regressor dated afterthe outcome, a filter that references the treatment window. Hard stopat 6 hours.The design burden here is destination-shaped: you are not writing the checks, you are writing the success criteria and — more importantly — the abort criteria. Three properties make a goal survivable:
- A measurable objective — “all episodes estimated, pre-trends flagged” is verifiable at 7 a.m.; “improve the robustness” is an invitation to creative accounting.
- Explicit stopping rules — the leakage clause and the hard stop are not decoration. An objective-driven agent without abort criteria optimizes through the night, including through things it should have woken you for.
- A contained blast radius — run it in a D2 worktree under the B3
pipeline profile, writing only under
results/. The morning review is then a diff, not an investigation.
The managed delegation is the point: hours of unattended, multi-step progress with one written brief — closer to handing a project to a senior RA than a checklist to a junior one.
| Intent | Claude Code | Codex |
|---|---|---|
| supervise a long-running job | /loop (recurring self-paced check) + a background task running the job | Goal Mode (objective-driven multi-hour run) |
| scheduled autonomous work | Routines (cloud; cron / GitHub / API triggers) | Codex cloud tasks + GitHub integration |
| background runs | local background tasks, composable with loops and workflows | cloud tasks in isolated containers, each returning a reviewable diff |
The decision rubric
The decision rule is shorter than the agonizing: if you can write the check, write a loop — a bounded, checkable condition wants recurring verification. If you can only write the objective, write a goal — an open-ended optimization with a measurable endpoint wants delegation. If you can write neither, you are not ready to run it overnight, and no tool fixes that.
| You can write down… | The run is | Reach for | Because the burden is |
|---|---|---|---|
| what “fine” looks like, checkable from a log | one fixed job to babysit | a loop watching a background task | a recurring check you author once |
| what “done” looks like, plus what must abort it | open-ended, multi-step | an objective-driven goal run | a destination with explicit stopping rules |
| neither, honestly | not ready | nothing — write the check or the objective first | judgment you have not yet externalized |
The two transcripts in the scenario below are the same night run under each philosophy. Read them side by side once before you run your own: watch where the loop chooses to speak and where the goal run chooses to decide alone, because those two moments are where each philosophy’s signature failure lives.
Both philosophies have a signature failure mode, and each is the other’s mirror:
- Loops spam. A check every two minutes on a log that updates every twenty produces sixty token-burning “OK”s a night and a supervisor you stop reading. Match cadence to how fast the watched thing actually changes.
- Goals game. An objective-framed prompt invites metric gaming — the specification that “passes” pre-trends because the sample quietly shrank. This is not a hypothetical; it is why D4 pairs every goal-shaped run with an adversarial referee.
And both run under the same discipline regardless of tool: the B3
pipeline profile (writes scoped to results/ and data/processed/),
inside a D2 worktree, so that whatever happens at 3 a.m. happens in a
sandbox your morning self can diff, merge, or delete.
Guided Run — The Night Shift: a loop you can write down
claudeGuided Run — The Night Shift: a destination you can write down
claudeField Assignment
Artifact make check-d3 passes — both overnight transcripts filed and compared
Run the project’s two long jobs overnight, one under each philosophy, and bring back the transcripts. This is deliberately a both-tools exercise: the comparison is the deliverable.
- [CC] Start the main event-study estimation as a background task in a fresh worktree under the pipeline profile; supervise it with a loop prompt that checks divergence, plateau, and process death on a 15-minute cadence, quiet otherwise.
- [CX] The same evening, hand the robustness batch to Goal Mode with a measurable objective, the leakage abort clause, and a hard time stop, in its own worktree.
- In the morning, read both transcripts before touching results: when did the loop speak and was it right to; what did the goal run decide alone and would you have decided the same.
- File both transcripts and a one-page comparison in
journal/— which philosophy you’d trust with what, and why. Thenmake check-d3.
The comparison memo feeds D4 directly: the fleet you are about to run is goal-shaped work at scale, and the referee exists because of what you noticed in step 3.
make check-d3advances D3The loop prompt must be checkable, bounded, and quiet by default.
Writes scoped to results/ and data/processed/ — the 3 a.m. failure must be a deletable directory.
When did the loop speak; what did the goal run decide alone; which interventions were yours.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Objective-framed prompts invite metric gaming. “Make the pre-trends pass” can be satisfied by shrinking the sample, re-binning the event window, or dropping the inconvenient episodes — all of which look like diligence in a transcript. Pair every goal-shaped run with adversarial review (D4’s referee), and write objectives about what to estimate, never about what the answer should look like.
- [CC]
Over-frequent loop checks burn tokens for nothing and bury the one report that matters under sixty OKs. The cadence question is empirical: how fast does the watched thing change? Estimation logs move in tens of minutes, not seconds.
- [both]
Long runs outside a worktree leave half-written state in your main tree when they die at 3 a.m. — and they will, eventually, die at 3 a.m. The worktree is not optional hygiene; it is what makes the overnight failure a deletable directory instead of a forensic morning.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
There is no isomorphism here and this page hasn’t pretended otherwise: Claude Code composes local primitives — a recurring loop, background tasks, cloud Routines for scheduled work — while Codex delegates to a managed objective-driven run, with cloud tasks as the scheduled analogue. Neither tool offers the other’s primitive natively; the loop-with-an-objective-framed-prompt approximates Goal Mode about as well as a checklist approximates a brief, which is to say usefully and incompletely. The asymmetry is a design philosophy, not a feature gap, and it is taught as such above.