Cheat sheet

E2 advanced ~45 min

The Lab That Runs Itself: Scheduled & Cloud Agents

Absorbs: the lab manager's standing chores

Advances E2

The Pain

The TLC publishes its taxi data on a calendar you do not control: a new month, every month, on a roughly two-month lag, dropped onto a CDN without an announcement that reaches you. Your analysis was current the day you ran it and has been quietly rotting ever since. There are three fresh months on the server right now that your panel does not know exist, and the elasticities you are about to present rest on a window that closed in the winter.

Keeping a study current is the least glamorous job in any lab and the one most reliably skipped. Someone has to remember the cadence, check the source on a schedule, pull the new file when it lands, run it through the same gauntlet every prior month passed, fold it into the warehouse, and re-estimate — and then, crucially, not just publish the new numbers, but bring them to a person who decides whether they belong. It is standing-chore work: low-judgment to perform, high-stakes to skip, and impossible to do on a cadence when the person responsible is also writing the paper, supervising the RAs, and teaching two sections. The data keeps arriving. Your attention does not keep pace. The gap between the latest drop and your latest estimate is the half-life of your study’s relevance, and right now nobody is watching the calendar.

Why / When

The standing chores of a lab share a shape: they recur on a cadence nobody owns, each instance is mechanical, and the cost is not in any one instance but in the forgetting. This lesson hands those chores to an agent that runs without your laptop open — scheduled in the cloud, triggered by a clock or an event or a person filing a ticket, doing the mechanical work and stopping at the one decision that needs a human.

The two tools reach this from opposite directions, and the difference is the lesson — the same split this unit keeps returning to. One composes a scheduled cloud agent you define declaratively: a trigger, a narrow profile, a task, on a cron. The other delegates to a cloud environment you assign work to like an RA — you open a ticket, the agent picks it up, investigates, and reports back. Schedule versus delegate; a clock that fires versus a colleague you hand a task. Both absorb the lab manager’s standing chores; both run somewhere other than your machine, which is the source of both their power and their distinct risk class. This serves the ingestion and maintenance stage — the work of keeping a live study live.

Contrary winds

Not for: a one-time backfill you'll run once and never again — scheduling has standing overhead and a standing risk surface, so don't put a single errand on a cron.

Mechanics

The chore made visible — the monthly cadence the rest of this lesson automates. Watch where it stops:

System Player film — A Month in the Life
A month in the life of a self-maintaining lab: a TLC monthly data drop lands on the calendar, the scheduled routine wakes unattended, runs the C2 contracts on the new month, re-estimates the headline specifications, and opens a pull request that stops at a human-approval gate — it reports the updated estimates but never merges them itself. ONE MONTH, UNATTENDED DAY 1 DROP DAY PR FILED THE CADENCE TLC drops monthly ~2-mo lag · cron-set NEW MONTH WAKES CHECK PASS DRAFTS A PR STOPS HERE TLC DROP new month, on CDN real parquet data SCHEDULED ROUTINE cloud cron · no laptop narrow profile · spend cap C2 CONTRACTS validate the new month append to warehouse RE-ESTIMATE headline specs anew numbers may move PULL REQUEST updated estimates drafted, not merged HUMAN GATE report, don't act
step 1/7

Step 1 of 7.

The lab is no longer something you sit at. TLC publishes new taxi data every month — with a ~2-month lag — so a static analysis rots on contact with the next drop. The fix is a cadence: a routine set on a cron, waiting for a date that hasn't come yet.

The payoff is the stop. The routine does every mechanical step — checks the source, downloads, runs the C2 contracts, appends, re-estimates — and then opens a pull request and waits. It reports the updated estimates; it never publishes them. “Report, don’t act” on anything irreversible is the whole safety posture of an unattended agent, and the two tools below implement the same cadence through different primitives.

Claude Code Your tool

Routines — a scheduled cloud agent

A Routine is a cloud-hosted agent on a trigger: a cron schedule, a GitHub event, or an API call wakes it, it runs its task in a managed environment with no laptop involved, and it goes back to sleep. You define it declaratively — when it fires, what profile it runs under, what it does — and the cloud keeps the clock.

The project’s routine is the monthly ingest — it keeps current the same fixed slice you first fetched with the kit’s python3 get_data.py (Get the data), extending it as new months land. It fires after the TLC’s usual drop date and walks the cadence end to end:

the monthly-ingest routine, in its own brief
Trigger: cron, the 5th of each month, 06:00 ET.
Profile: ingest-only (write data/raw/ and results/, nothing else).
1. Check the TLC CDN for a yellow-taxi month newer than the latest in
data/raw/. If none, exit quietly — no PR, no noise.
2. Download it to a temp path, verify the checksum, move into data/raw/.
3. Run scripts/validate_contracts.py over the new month (the C2 gate).
If it fails, STOP and open an issue with the contract output — do
not append a month that breaks its contract.
4. Append to the warehouse; re-estimate the headline specs.
5. Open a PR titled "ingest: <month>" with the updated estimates and a
diff of the elasticity table — and STOP. A human merges, or doesn't.

The structure is the safety. The routine’s world is exactly the cloud environment it runs in — it cannot reach a local MCP server or a file on your laptop, because your laptop is asleep, so the task is designed for a self-contained world: public CDN in, repository out. And the cron is not the interesting part; the stop is. Step 5 produces a reviewable diff and hands it to you. The estimate-changing decision — does this month’s data belong in the published result — is the one thing the routine is forbidden to make.

The same Routine machinery runs the E1 reproducibility self-test on a weekly cron, or fires the referee on a GitHub event. The monthly ingest is one instance of a general pattern: standing work, on a schedule, ending at a gate.

Codex Your tool

Cloud delegation — assign the chore like an RA

The other model is delegation to a cloud environment: a managed, sandboxed container where the agent does work you assign, asynchronously, without your machine. You do not write a cron; you hand it a task the way you would hand one to a research assistant — by filing it — and the agent picks it up, works in its isolated environment, and posts its findings back where you filed it.

The refresh investigation is assigned rather than scheduled. You open a GitHub issue and mention the cloud agent in it:

the refresh, filed as an issue
Title: Monthly refresh — is there a new TLC month?
@reviewer check the TLC CDN for a yellow-taxi month newer than the
latest in data/raw/. If there is one: download it, run
scripts/validate_contracts.py over it, and report back here whether it
passes its contract and what the headline elasticities would become if
we appended it. Do NOT append it or open a PR yet — just report.

The agent spins up its environment, does the investigation, and posts a comment on the issue: the new month’s number, the contract verdict, the estimates it would produce. The chore becomes a conversation in the issue tracker — auditable, assignable, and stopping by default at a report rather than an action. If the report looks right, you ask it to open the PR in a follow-up; the irreversible step stays yours.

For teams that live in a project tracker rather than GitHub, the same delegation flows from the tracker’s sidebar — file the refresh as a tracker task, the agent picks it up in its cloud environment and reports back on the task. The surface changes; the model does not: assign, investigate, report, await your word on anything that writes.

Translation guide
Intent Claude Code Codex
scheduled autonomous work Routines (cloud-hosted, cron / GitHub-event / API triggers) cloud tasks delegated via the issue tracker + GitHub integration
kick off the monthly refresh a cron trigger fires the routine unattended you (or a teammate) file an issue assigning it to the cloud agent
where the unattended work runs a managed cloud environment — no local files or MCP servers reachable an isolated cloud container — local-only resources are likewise out of reach
the irreversible step (publish the estimate-changing PR) routine opens the PR and STOPS; a human merges agent reports; a human asks for the PR; a human merges

Guardrails for unattended agents

This is the sober section, and it is shared because the discipline is identical regardless of which primitive runs the chore. An unattended agent is a different risk class from an interactive one: there is no human watching the step it is about to take, so every guardrail you lean on interactively — I’ll just glance at what it’s doing — is gone. Four non-negotiables:

  • A dedicated, narrower profile — never the interactive one. The routine that ingests data needs to write data/raw/ and results/ and nothing else. It does not need your full permission set, and the blast radius of an unattended agent is whatever you granted it while no one was looking. Build the profile for the chore, not for your convenience.
  • Spend caps. A scheduled agent with a loop and no budget ceiling is a bill with no upper bound. Cap the tokens and the wall-clock per run; a routine that blows its cap should stop and report, not push through.
  • “Report, don’t act” on anything irreversible. The default for an unattended agent facing a one-way door — publishing, deleting, merging, sending — is to describe what it would do and stop. The monthly ingest does every reversible step and halts at the PR. The irreversible step is a human’s.
  • Human approval on any estimate-changing PR. This is the specific case the whole lesson protects. A month that quietly shifts the published elasticities is exactly the month a person must look at. The PR is the gate; auto-merging it is how one bad TLC drop becomes your published result.

These are not paranoia; they are the price of the laptop being closed. The C2 hook protected a write you were present for; these protect a write made while you were asleep, which is strictly the more dangerous one.

Guided Run — The Standing Chore

Field Terminal — session: e2-routine Claude Code
Define a monthly-ingest routine on a cron with an ingest-only profile

Guided Run — The Standing Chore

Field Terminal — session: e2-routine Claude Code
Define a monthly-ingest routine on a cron with an ingest-only profile

Field Assignment

Artifact make check-e2 passes — the monthly refresh runs unattended, dry-run against the latest real TLC drop, ending at a PR/report a human approves

Stand up the monthly refresh under each tool and dry-run it against the latest real TLC drop. This is a both-tools exercise: the contrast in how the chore is triggered and where it stops is the deliverable.

  1. [CC] Define the monthly-ingest Routine with a dedicated ingest-only profile, a spend cap, and a cron trigger. Dry-run it against the latest real TLC month: it must check the CDN, download, pass the C2 contracts, re-estimate, and stop at a PR — never merge.
  2. [CX] File the refresh as an issue assigned to the cloud agent. Confirm it investigates in its isolated environment and reports back on the issue — the new month, the contract verdict, the would-be estimates — without writing.
  3. For both: confirm the guardrails actually bind. Try to make each one take the irreversible step (merge / append) unattended and verify it refuses and reports instead.
  4. Log behavior and cost for both runs in journal/: what each did autonomously, where each stopped, the token and wall-clock spend, and which guardrail you were most glad you set. Then make check-e2.

make check-e2 verifies the refresh ran end-to-end against a real month, that it stopped at a human-approval gate rather than publishing, and that the run cost and behavior are logged. This is the cadence E3 packages so a new lab inherits it in one command.

Milestone gate · make check-e2advances E2
  1. Designed for a self-contained world: public CDN in, repository out — no local files or MCP servers reachable.

  2. Auto-merging an estimate-changing PR is how one bad month becomes the published result.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    Unattended agents are a different risk class — dedicated profile, always. The permission set that is fine when you are watching every step is a standing liability when no one is. A routine that runs at 3 a.m. under your full interactive profile is a key under the mat; the chore needs exactly the access the chore requires and not one scope more.

  • [both] 〜〜

    Auto-merging an estimate-changing PR is how one bad month becomes the published result. The entire value of the monthly refresh is that it stops at a human before the numbers ship; wire it to merge itself and you have automated the one decision that needed judgment, turning a schema drift or a half-published TLC file into your headline elasticity with no one in the loop.

  • [CC]

    A routine runs in the cloud: local-only MCP servers and files on your laptop are not reachable, because your laptop is closed. Design the routine’s world to be self-contained — public source in, repository out — or it will fail at 6 a.m. reaching for a server that only exists where you are asleep.

Check Your Bearings

E2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

There is no isomorphism here and this page has not pretended one. Claude Code composes a scheduled cloud agent you define declaratively — a trigger, a profile, a task on a cron — while Codex delegates to a cloud environment you assign work to through the issue tracker, with a project tracker as an alternate intake. Neither is the other’s primitive: a cron that fires unattended is not the same shape as a colleague you hand a ticket, even when both produce the monthly refresh and both stop at the same human-approval gate. The guardrails — a dedicated narrow profile, spend caps, report-don’t-act on the irreversible, human approval on the estimate-changing PR — are identical across both, because they are a discipline the laptop being closed demands, not a feature either vendor sells. See the parity matrix for the dated comparison.

Ledger — E2

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.