Cheat sheet

B2 beginner ~45 min

A Reproducible Home

Absorbs: the careful senior who plans before touching data

Advances B2

The Pain

Every research project that dies of disorganization dies the same way, and it starts innocently: analysis_v2_final.py next to analysis_v2_final_REAL.py, a temp/ directory that is four months old, raw files edited “only that once” to fix an encoding, and a notebook whose cells run correctly in exactly one order that nobody wrote down. You know this repo. You may, somewhere on a backup drive, own this repo.

The second act is quieter and more expensive. Week two, eager for a first result, you joined trips to zones on a column that was almost, but not exactly, the key — LocationID against a lookup that had been deduplicated differently — and the join was wrong by a few hundred rows out of millions. Nothing crashed; the panel built; results accumulated on top. By the time the discrepancy surfaced, fourteen downstream files assumed the bad join, and “fix it” meant “find everything that ever touched it.” A senior colleague would have stopped you on day one: structure first, then a written plan of every join and aggregation, reviewed before anything is computed — because aggregations are lossy, queries cost money, and a wrong early join poisons everything downstream. That careful senior is a luxury hire. You are about to add a collaborator who works at machine speed and amplifies whatever structure exists — including none.

Why / When

This lesson sets up the three disciplines that make agent mistakes cheap, in rising order of subtlety. Structure: a repository layout where every path means something, so an assistant — human or machine — can navigate it without folklore. Plan-first: a read-only reconnaissance of the data, producing a written audit plan that you approve before any transform runs; the cheapest moment to catch the wrong join is before it executes. Undo: sessions, checkpoints, and cheap rollback, so that experiments cost minutes instead of forensics. Together they absorb the careful senior who plans before touching data, and they accelerate the project-setup and data-acquisition stages — the unglamorous fifth of the project that determines whether the other four-fifths reproduce.

An agent multiplies whatever it finds: in a structured repo it re-runs pipelines and respects fences; in temp/-and-versions soup it generates more soup, faster than you ever could alone.

Contrary winds

Not for: a one-afternoon scratch analysis of data you may never touch again — though count, honestly, how many of those became Chapter 2.

Mechanics

Three disciplines, in order: a navigable home, the plan-first habit, and cheap undo.

A repo agents can navigate

The scaffold is the contract every later lesson assumes — B3 fences it, C2 enforces it, D4 writes results into it:

weather-mobility/ — the scaffold
weather-mobility/
├── data/
│ ├── raw/ # append-only; never edited (convention now, law in B3/C2)
│ └── processed/ # rebuilt from raw by scripts — always disposable
├── src/ # transforms and estimation — the only code that merges
├── scripts/ # download, checksum, validation — plain shell lives here
├── notebooks/ # exploration; nothing load-bearing may live here
├── docs/ # the data-audit plan, design memos
├── results/ # written by runs, owned by contracts (D4)
├── report/ # the report skeleton
├── journal/ # incidents, decisions, costs — F1's raw material
└── Makefile # make check-b2 … make check-f1

Two conventions carry most of the weight. data/raw/ is append-only: files arrive by script, are never edited, and are the only thing that cannot be regenerated — everything in data/processed/ and results/ must be rebuildable from raw by code in the repo. And notebooks/ is quarantine: the moment an exploration matters, it graduates to src/.

Pin the environment, because agents re-run everything, often — an unpinned dependency is a result that changes when nobody changed anything:

Python

pinned environment — uv
uv init --python 3.12
uv add duckdb pyarrow pandas
# uv.lock now pins the exact tree; commit it.
# Any rerun — yours, the agent's, a referee's — goes through:
uv run python src/build_panel.py

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

pinned environment — renv
renv::init()
renv::install(c("duckdb", "arrow", "data.table"))
renv::snapshot() # renv.lock now pins the exact tree; commit it.
# Any rerun — yours, the agent's, a referee's — starts from:
renv::restore()

Raw data by plain shell

Twenty-four months of yellow and green trips arrive by the simplest tool that works — a download script and a checksum manifest. (To just have the fixed course slice on disk, run the kit’s python3 get_data.py — see Get the data; here we build the fetch ourselves to learn the pattern.) This is a deliberate lesson: fetching files needs no protocol, no server, no integration; richer plumbing earns its keep in C3, when the agent needs live systems, not files.

scripts/download_trips.sh
#!/usr/bin/env bash
set -euo pipefail
BASE=https://d37ci6vzurychx.cloudfront.net/trip-data
while read -r ym; do # months.txt: 2023-01 … 2024-12
for color in yellow green; do
out="data/raw/${color}_${ym}.parquet"
# Download to a temp name and move only on success: an interrupted
# curl must never leave a truncated file for the checksum manifest
# to canonize.
[ -f "$out" ] || {
curl -fsSL -o "${out}.tmp" \
"${BASE}/${color}_tripdata_${ym}.parquet"
mv "${out}.tmp" "$out"
}
done
done < scripts/months.txt
shasum -a 256 data/raw/*.parquet > data/raw/SHA256SUMS

The manifest is the point, not the download. TLC re-publishes months — silently, with revised rows — and shasum -a 256 -c data/raw/SHA256SUMS (from the repo root, where the manifest’s paths resolve) is how you find out on your terms instead of mid-revision. The checksums are also the data-version hash that C2’s session briefing will print at the start of every session.

Plan before you touch data

Now the discipline this page is named for. The first pass over raw data should be read-only reconnaissance ending in a written plan — which joins, on which keys, validated how; which aggregations, with what lost; what gets checked before anything is computed. One tool makes this a first-class mode.

Claude Code Your tool

Plan mode + the Explore and Plan agents

Plan mode is a switchable agent state in which nothing is written: the agent may read files, run read-only commands, and think, but every mutating action is off the table until you approve a written plan. Engage it and hand over the reconnaissance:

the data-audit brief
> (plan mode) Survey data/raw/: schemas of all 48 parquet files and the
zone lookup. Propose docs/data-audit-plan.md covering: every join in
the panel build (keys, expected cardinality, how validated), every
aggregation (what is lost at each), null-rate and range checks per
column, and the timezone policy for all timestamp joins. Flag the
three joins most likely to go wrong.

The built-in Explore agent does the breadth work — schema by schema, in a separate context so the survey’s noise never pollutes the main session — and the Plan agent assembles the plan for your approval. The output is a document, and that is the point: the wrong early join from the Pain vignette is caught at the cost of reading two pages, not at the cost of fourteen downstream files. You approve the plan; only then does anything execute.

Codex has no Plan mode — the nearest equivalent is below.

Nearest equivalent — Codex

The same discipline assembles from two pieces: switch the session to the read-only approval mode (the sandbox refuses writes, so the reconnaissance physically cannot mutate anything) and dispatch the built-in explorer subagent to do the survey, with the same brief and the same mandated deliverable, docs/data-audit-plan.md. What you lose against the native mode is the seam: leaving read-only is a settings change you make rather than an approval gate the tool holds you to, so the “nothing executes before the plan is approved” rule is your habit, not the machine’s. Write the plan-approval step into the project checklist and the habit holds.

Watch this space Read-only planning is the most-requested convergence point on both roadmaps; recheck quarterly.

Sessions, checkpoints, undo

The third discipline is making mistakes cheap after they happen. Both tools persist sessions; the rollback primitives differ.

Claude Code Your tool

Checkpoints + /rewind

Every agent action lands on a checkpoint timeline. When the zone join goes sideways — wrong key, mangled output — /rewind rolls the working tree and the conversation back to the checkpoint before the damage, and you retry from there at the cost of a keystroke. Sessions resume (claude --resume) and fork, so yesterday’s reconnaissance session can branch into today’s two competing approaches without either contaminating the other. Treat rewinding as a normal movement, not an emergency brake: cheap experiments require cheap undo.

Codex Your tool

Conversation forking + exec resume

Conversations fork: any point in a session can branch into an alternative line of attack while the original stays intact, which is the natural way to try two join strategies against each other. Headless runs continue with codex exec resume, so a long reconnaissance survives your laptop lid. File-state rollback, though, belongs to git — there is no working-tree time machine, so the commit cadence below is carrying more of the load here.

Underneath either tool, git is the shared substrate and the only undo that survives a power cut: commit at every working state, and let the agent write the commit messages — it was there, it knows what changed, and it is not embarrassed to write “fix the zone join, again.”

Guided Run — Plan Before You Touch

Field Terminal — session: b2-plan-audit Claude Code
claude --permission-mode plan

Field Assignment

Artifact make check-b2 passes

Build the home, fill it with verified raw data, and produce the approved audit plan — in that order, with nothing computed until step 4 signs off.

  1. Scaffold the repository as above; commit the empty structure with a README line per directory stating its contract.
  2. Pin the environment (uv or renv) and commit the lockfile.
  3. Run scripts/download_trips.sh for all 24 months; verify shasum -a 256 -c data/raw/SHA256SUMS passes on a second run.
  4. Produce docs/data-audit-plan.md by read-only reconnaissance — per your tool, below — and approve it deliberately, as the senior colleague you are standing in for.

Claude Code

Engage Plan mode and dispatch the data-audit brief from the Mechanics section. Read the proposed plan critically — does it name the join keys? does it say what each aggregation loses? — request one revision (there is always one), then approve. Only after approval, exit Plan mode.

Codex

Switch to the read-only approval mode, dispatch the explorer subagent with the data-audit brief from the Mechanics section, and hold the seam yourself: the plan gets your written sign-off in the document header before the session leaves read-only. Request one revision (there is always one), then sign.

make check-b2 verifies the scaffold, the lockfile, the checksum manifest, and the existence of the approved plan. The artifact feeds the whole midgame: C2’s contracts enforce what the plan promised, and C3 builds the panel the plan specified.

Milestone gate · make check-b2advances B2
  1. data/processed and results/ must be rebuildable from raw by code in the repo.

  2. The manifest is how you learn when TLC silently re-publishes a month.

  3. Joins with keys and expected cardinality; aggregations with what each loses; timezone policy.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both]

    Letting the agent “organize” a messy folder without a plan. It will — at speed, with confidence, and with its own opinions about what final_v2 meant. Reconnaissance and a written plan first; file moves are transforms too.

  • [both] 〜〜

    Unpinned environments mean nothing reproduces next month — a dependency’s minor release changes a default, and an estimate moves with no code change anywhere. For a report, that is not an inconvenience; it is a correction waiting to be written. Commit the lockfile and rebuild from it, every time.

  • [CC]

    Treating rollback as emergency-only. If undoing costs a keystroke, risky experiments cost nothing and you will run more of them; a rollback you are reluctant to use is a tax on exactly the exploration this stage needs.

  • [both] 〜〜

    Downloads without checksums. TLC re-publishes months with revised rows, quietly; without a manifest, your collaborator’s “same data” and yours can differ and neither of you will know until the estimates disagree. The checksum line is one command and it is the difference between “the data changed” being a fact and a suspicion.

Check Your Bearings

B2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

Plan mode is genuinely one-sided: Claude Code holds the read-only reconnaissance and the plan-approval gate as a first-class mode with dedicated Explore and Plan agents, while Codex reaches the same discipline by configuration — a read-only approval mode plus its explorer subagent — with the approval seam held by your checklist rather than the tool. On undo, the asymmetry runs the same direction at smaller scale: working-tree checkpoints with rewind on one side, conversation forking plus git on the other. Sessions persist and resume on both.

Ledger — B2

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.