The Pain
The TLC publishes its taxi data on a calendar you do not control: a new month, every month, on a roughly two-month lag, dropped onto a CDN without an announcement that reaches you. Your analysis was current the day you ran it and has been quietly rotting ever since. There are three fresh months on the server right now that your panel does not know exist, and the elasticities you are about to present rest on a window that closed in the winter.
Keeping a study current is the least glamorous job in any lab and the one most reliably skipped. Someone has to remember the cadence, check the source on a schedule, pull the new file when it lands, run it through the same gauntlet every prior month passed, fold it into the warehouse, and re-estimate — and then, crucially, not just publish the new numbers, but bring them to a person who decides whether they belong. It is standing-chore work: low-judgment to perform, high-stakes to skip, and impossible to do on a cadence when the person responsible is also writing the paper, supervising the RAs, and teaching two sections. The data keeps arriving. Your attention does not keep pace. The gap between the latest drop and your latest estimate is the half-life of your study’s relevance, and right now nobody is watching the calendar.
Why / When
The standing chores of a lab share a shape: they recur on a cadence nobody owns, each instance is mechanical, and the cost is not in any one instance but in the forgetting. This lesson hands those chores to an agent that runs without your laptop open — scheduled in the cloud, triggered by a clock or an event or a person filing a ticket, doing the mechanical work and stopping at the one decision that needs a human.
The two tools reach this from opposite directions, and the difference is the lesson — the same split this unit keeps returning to. One composes a scheduled cloud agent you define declaratively: a trigger, a narrow profile, a task, on a cron. The other delegates to a cloud environment you assign work to like an RA — you open a ticket, the agent picks it up, investigates, and reports back. Schedule versus delegate; a clock that fires versus a colleague you hand a task. Both absorb the lab manager’s standing chores; both run somewhere other than your machine, which is the source of both their power and their distinct risk class. This serves the ingestion and maintenance stage — the work of keeping a live study live.
Contrary winds
Not for: a one-time backfill you'll run once and never again — scheduling has standing overhead and a standing risk surface, so don't put a single errand on a cron.
Mechanics
The chore made visible — the monthly cadence the rest of this lesson automates. Watch where it stops:
Step 1 of 7.
The lab is no longer something you sit at. TLC publishes new taxi data every month — with a ~2-month lag — so a static analysis rots on contact with the next drop. The fix is a cadence: a routine set on a cron, waiting for a date that hasn't come yet.
The payoff is the stop. The routine does every mechanical step — checks the source, downloads, runs the C2 contracts, appends, re-estimates — and then opens a pull request and waits. It reports the updated estimates; it never publishes them. “Report, don’t act” on anything irreversible is the whole safety posture of an unattended agent, and the two tools below implement the same cadence through different primitives.
Claude Code Your tool
Routines — a scheduled cloud agent
A Routine is a cloud-hosted agent on a trigger: a cron schedule, a GitHub event, or an API call wakes it, it runs its task in a managed environment with no laptop involved, and it goes back to sleep. You define it declaratively — when it fires, what profile it runs under, what it does — and the cloud keeps the clock.
The project’s routine is the monthly ingest — it keeps current the same
fixed slice you first fetched with the kit’s python3 get_data.py
(Get the data), extending it as new months land. It fires
after the TLC’s usual drop date and walks the cadence end to end:
Trigger: cron, the 5th of each month, 06:00 ET.Profile: ingest-only (write data/raw/ and results/, nothing else).
1. Check the TLC CDN for a yellow-taxi month newer than the latest in data/raw/. If none, exit quietly — no PR, no noise.2. Download it to a temp path, verify the checksum, move into data/raw/.3. Run scripts/validate_contracts.py over the new month (the C2 gate). If it fails, STOP and open an issue with the contract output — do not append a month that breaks its contract.4. Append to the warehouse; re-estimate the headline specs.5. Open a PR titled "ingest: <month>" with the updated estimates and a diff of the elasticity table — and STOP. A human merges, or doesn't.The structure is the safety. The routine’s world is exactly the cloud environment it runs in — it cannot reach a local MCP server or a file on your laptop, because your laptop is asleep, so the task is designed for a self-contained world: public CDN in, repository out. And the cron is not the interesting part; the stop is. Step 5 produces a reviewable diff and hands it to you. The estimate-changing decision — does this month’s data belong in the published result — is the one thing the routine is forbidden to make.
The same Routine machinery runs the E1 reproducibility self-test on a weekly cron, or fires the referee on a GitHub event. The monthly ingest is one instance of a general pattern: standing work, on a schedule, ending at a gate.
Codex Your tool
Cloud delegation — assign the chore like an RA
The other model is delegation to a cloud environment: a managed, sandboxed container where the agent does work you assign, asynchronously, without your machine. You do not write a cron; you hand it a task the way you would hand one to a research assistant — by filing it — and the agent picks it up, works in its isolated environment, and posts its findings back where you filed it.
The refresh investigation is assigned rather than scheduled. You open a GitHub issue and mention the cloud agent in it:
Title: Monthly refresh — is there a new TLC month?
@reviewer check the TLC CDN for a yellow-taxi month newer than thelatest in data/raw/. If there is one: download it, runscripts/validate_contracts.py over it, and report back here whether itpasses its contract and what the headline elasticities would become ifwe appended it. Do NOT append it or open a PR yet — just report.The agent spins up its environment, does the investigation, and posts a comment on the issue: the new month’s number, the contract verdict, the estimates it would produce. The chore becomes a conversation in the issue tracker — auditable, assignable, and stopping by default at a report rather than an action. If the report looks right, you ask it to open the PR in a follow-up; the irreversible step stays yours.
For teams that live in a project tracker rather than GitHub, the same delegation flows from the tracker’s sidebar — file the refresh as a tracker task, the agent picks it up in its cloud environment and reports back on the task. The surface changes; the model does not: assign, investigate, report, await your word on anything that writes.
| Intent | Claude Code | Codex |
|---|---|---|
| scheduled autonomous work | Routines (cloud-hosted, cron / GitHub-event / API triggers) | cloud tasks delegated via the issue tracker + GitHub integration |
| kick off the monthly refresh | a cron trigger fires the routine unattended | you (or a teammate) file an issue assigning it to the cloud agent |
| where the unattended work runs | a managed cloud environment — no local files or MCP servers reachable | an isolated cloud container — local-only resources are likewise out of reach |
| the irreversible step (publish the estimate-changing PR) | routine opens the PR and STOPS; a human merges | agent reports; a human asks for the PR; a human merges |
Guardrails for unattended agents
This is the sober section, and it is shared because the discipline is identical regardless of which primitive runs the chore. An unattended agent is a different risk class from an interactive one: there is no human watching the step it is about to take, so every guardrail you lean on interactively — I’ll just glance at what it’s doing — is gone. Four non-negotiables:
- A dedicated, narrower profile — never the interactive one. The
routine that ingests data needs to write
data/raw/andresults/and nothing else. It does not need your full permission set, and the blast radius of an unattended agent is whatever you granted it while no one was looking. Build the profile for the chore, not for your convenience. - Spend caps. A scheduled agent with a loop and no budget ceiling is a bill with no upper bound. Cap the tokens and the wall-clock per run; a routine that blows its cap should stop and report, not push through.
- “Report, don’t act” on anything irreversible. The default for an unattended agent facing a one-way door — publishing, deleting, merging, sending — is to describe what it would do and stop. The monthly ingest does every reversible step and halts at the PR. The irreversible step is a human’s.
- Human approval on any estimate-changing PR. This is the specific case the whole lesson protects. A month that quietly shifts the published elasticities is exactly the month a person must look at. The PR is the gate; auto-merging it is how one bad TLC drop becomes your published result.
These are not paranoia; they are the price of the laptop being closed. The C2 hook protected a write you were present for; these protect a write made while you were asleep, which is strictly the more dangerous one.
Guided Run — The Standing Chore
Define a monthly-ingest routine on a cron with an ingest-only profileGuided Run — The Standing Chore
Define a monthly-ingest routine on a cron with an ingest-only profileField Assignment
Artifact make check-e2 passes — the monthly refresh runs unattended, dry-run against the latest real TLC drop, ending at a PR/report a human approves
Stand up the monthly refresh under each tool and dry-run it against the latest real TLC drop. This is a both-tools exercise: the contrast in how the chore is triggered and where it stops is the deliverable.
- [CC] Define the monthly-ingest Routine with a dedicated ingest-only profile, a spend cap, and a cron trigger. Dry-run it against the latest real TLC month: it must check the CDN, download, pass the C2 contracts, re-estimate, and stop at a PR — never merge.
- [CX] File the refresh as an issue assigned to the cloud agent. Confirm it investigates in its isolated environment and reports back on the issue — the new month, the contract verdict, the would-be estimates — without writing.
- For both: confirm the guardrails actually bind. Try to make each one take the irreversible step (merge / append) unattended and verify it refuses and reports instead.
- Log behavior and cost for both runs in
journal/: what each did autonomously, where each stopped, the token and wall-clock spend, and which guardrail you were most glad you set. Thenmake check-e2.
make check-e2 verifies the refresh ran end-to-end against a real
month, that it stopped at a human-approval gate rather than publishing,
and that the run cost and behavior are logged. This is the cadence E3
packages so a new lab inherits it in one command.
make check-e2advances E2Designed for a self-contained world: public CDN in, repository out — no local files or MCP servers reachable.
Auto-merging an estimate-changing PR is how one bad month becomes the published result.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Unattended agents are a different risk class — dedicated profile, always. The permission set that is fine when you are watching every step is a standing liability when no one is. A routine that runs at 3 a.m. under your full interactive profile is a key under the mat; the chore needs exactly the access the chore requires and not one scope more.
- [both]
〜〜
Auto-merging an estimate-changing PR is how one bad month becomes the published result. The entire value of the monthly refresh is that it stops at a human before the numbers ship; wire it to merge itself and you have automated the one decision that needed judgment, turning a schema drift or a half-published TLC file into your headline elasticity with no one in the loop.
- [CC]
A routine runs in the cloud: local-only MCP servers and files on your laptop are not reachable, because your laptop is closed. Design the routine’s world to be self-contained — public source in, repository out — or it will fail at 6 a.m. reaching for a server that only exists where you are asleep.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
There is no isomorphism here and this page has not pretended one. Claude Code composes a scheduled cloud agent you define declaratively — a trigger, a profile, a task on a cron — while Codex delegates to a cloud environment you assign work to through the issue tracker, with a project tracker as an alternate intake. Neither is the other’s primitive: a cron that fires unattended is not the same shape as a colleague you hand a ticket, even when both produce the monthly refresh and both stop at the same human-approval gate. The guardrails — a dedicated narrow profile, spend caps, report-don’t-act on the irreversible, human approval on the estimate-changing PR — are identical across both, because they are a discipline the laptop being closed demands, not a feature either vendor sells. See the parity matrix for the dated comparison.