B2 beginner ~45 min

A Reproducible Home

Absorbs: the careful senior who plans before touching data

Advances B2

The Pain

Every research project that dies of disorganization dies the same way, and it starts innocently: analysis_v2_final.py next to analysis_v2_final_REAL.py, a temp/ directory that is four months old, raw files edited “only that once” to fix an encoding, and a notebook whose cells run correctly in exactly one order that nobody wrote down. You know this repo. You may, somewhere on a backup drive, own this repo.

The second act is quieter and more expensive. Week two, eager for a first result, you joined trips to zones on a column that was almost, but not exactly, the key — LocationID against a lookup that had been deduplicated differently — and the join was wrong by a few hundred rows out of millions. Nothing crashed; the panel built; results accumulated on top. By the time the discrepancy surfaced, fourteen downstream files assumed the bad join, and “fix it” meant “find everything that ever touched it.” A senior colleague would have stopped you on day one: structure first, then a written plan of every join and aggregation, reviewed before anything is computed — because aggregations are lossy, queries cost money, and a wrong early join poisons everything downstream. That careful senior is a luxury hire. You are about to add a collaborator who works at machine speed and amplifies whatever structure exists — including none.

Why / When

This lesson sets up the three disciplines that make agent mistakes cheap, in rising order of subtlety. Structure: a repository layout where every path means something, so an assistant — human or machine — can navigate it without folklore. Plan-first: a read-only reconnaissance of the data, producing a written audit plan that you approve before any transform runs; the cheapest moment to catch the wrong join is before it executes. Undo: sessions, checkpoints, and cheap rollback, so that experiments cost minutes instead of forensics. Together they absorb the careful senior who plans before touching data, and they accelerate the project-setup and data-acquisition stages — the unglamorous fifth of the project that determines whether the other four-fifths reproduce.

An agent multiplies whatever it finds: in a structured repo it re-runs pipelines and respects fences; in temp/-and-versions soup it generates more soup, faster than you ever could alone.

Contrary winds

Not for: a one-afternoon scratch analysis of data you may never touch again — though count, honestly, how many of those became Chapter 2.

Mechanics

Three disciplines, in order: a navigable home, the plan-first habit, and cheap undo.

A repo agents can navigate

The scaffold is the contract every later lesson assumes — B3 fences it, C2 enforces it, D4 writes results into it:

weather-mobility/
├── data/
│   ├── raw/            # append-only; never edited (convention now, law in B3/C2)
│   └── processed/      # rebuilt from raw by scripts — always disposable
├── src/                # transforms and estimation — the only code that merges
├── scripts/            # download, checksum, validation — plain shell lives here
├── notebooks/          # exploration; nothing load-bearing may live here
├── docs/               # the data-audit plan, design memos
├── results/            # written by runs, owned by contracts (D4)
├── report/             # the report skeleton
├── journal/            # incidents, decisions, costs — F1's raw material
└── Makefile            # make check-b2 … make check-f1

Two conventions carry most of the weight. data/raw/ is append-only: files arrive by script, are never edited, and are the only thing that cannot be regenerated — everything in data/processed/ and results/ must be rebuildable from raw by code in the repo. And notebooks/ is quarantine: the moment an exploration matters, it graduates to src/.

Pin the environment, because agents re-run everything, often — an unpinned dependency is a result that changes when nobody changed anything:

Python

uv init --python 3.12
uv add duckdb pyarrow pandas
# uv.lock now pins the exact tree; commit it.
# Any rerun — yours, the agent's, a referee's — goes through:
uv run python src/build_panel.py

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

renv::init()
renv::install(c("duckdb", "arrow", "data.table"))
renv::snapshot()   # renv.lock now pins the exact tree; commit it.
# Any rerun — yours, the agent's, a referee's — starts from:
renv::restore()

Raw data by plain shell

Twenty-four months of yellow and green trips arrive by the simplest tool that works — a download script and a checksum manifest. (To just have the fixed course slice on disk, run the kit’s python3 get_data.py — see Get the data; here we build the fetch ourselves to learn the pattern.) This is a deliberate lesson: fetching files needs no protocol, no server, no integration; richer plumbing earns its keep in C3, when the agent needs live systems, not files.

#!/usr/bin/env bash
set -euo pipefail
BASE=https://d37ci6vzurychx.cloudfront.net/trip-data

while read -r ym; do                       # months.txt: 2023-01 … 2024-12
  for color in yellow green; do
    out="data/raw/${color}_${ym}.parquet"
    # Download to a temp name and move only on success: an interrupted
    # curl must never leave a truncated file for the checksum manifest
    # to canonize.
    [ -f "$out" ] || {
      curl -fsSL -o "${out}.tmp" \
        "${BASE}/${color}_tripdata_${ym}.parquet"
      mv "${out}.tmp" "$out"
    }
  done
done < scripts/months.txt

shasum -a 256 data/raw/*.parquet > data/raw/SHA256SUMS

The manifest is the point, not the download. TLC re-publishes months — silently, with revised rows — and shasum -a 256 -c data/raw/SHA256SUMS (from the repo root, where the manifest’s paths resolve) is how you find out on your terms instead of mid-revision. The checksums are also the data-version hash that C2’s session briefing will print at the start of every session.

Plan before you touch data

Now the discipline this page is named for. The first pass over raw data should be read-only reconnaissance ending in a written plan — which joins, on which keys, validated how; which aggregations, with what lost; what gets checked before anything is computed. One tool makes this a first-class mode.

Claude Code Your tool

Plan mode + the Explore and Plan agents

Plan mode is a switchable agent state in which nothing is written: the agent may read files, run read-only commands, and think, but every mutating action is off the table until you approve a written plan. Engage it and hand over the reconnaissance:

> (plan mode) Survey data/raw/: schemas of all 48 parquet files and the
  zone lookup. Propose docs/data-audit-plan.md covering: every join in
  the panel build (keys, expected cardinality, how validated), every
  aggregation (what is lost at each), null-rate and range checks per
  column, and the timezone policy for all timestamp joins. Flag the
  three joins most likely to go wrong.

The built-in Explore agent does the breadth work — schema by schema, in a separate context so the survey’s noise never pollutes the main session — and the Plan agent assembles the plan for your approval. The output is a document, and that is the point: the wrong early join from the Pain vignette is caught at the cost of reading two pages, not at the cost of fourteen downstream files. You approve the plan; only then does anything execute.

Nearest equivalent — Codex

The same discipline assembles from two pieces: switch the session to the read-only approval mode (the sandbox refuses writes, so the reconnaissance physically cannot mutate anything) and dispatch the built-in explorer subagent to do the survey, with the same brief and the same mandated deliverable, docs/data-audit-plan.md. What you lose against the native mode is the seam: leaving read-only is a settings change you make rather than an approval gate the tool holds you to, so the “nothing executes before the plan is approved” rule is your habit, not the machine’s. Write the plan-approval step into the project checklist and the habit holds.

Watch this space as of 2026-06 Read-only planning is the most-requested convergence point on both roadmaps; recheck quarterly.

Sessions, checkpoints, undo

The third discipline is making mistakes cheap after they happen. Both tools persist sessions; the rollback primitives differ.

Claude Code Your tool

Checkpoints + /rewind

Every agent action lands on a checkpoint timeline. When the zone join goes sideways — wrong key, mangled output — /rewind rolls the working tree and the conversation back to the checkpoint before the damage, and you retry from there at the cost of a keystroke. Sessions resume (claude --resume) and fork, so yesterday’s reconnaissance session can branch into today’s two competing approaches without either contaminating the other. Treat rewinding as a normal movement, not an emergency brake: cheap experiments require cheap undo.

Codex Your tool

Conversation forking + exec resume

Conversations fork: any point in a session can branch into an alternative line of attack while the original stays intact, which is the natural way to try two join strategies against each other. Headless runs continue with codex exec resume, so a long reconnaissance survives your laptop lid. File-state rollback, though, belongs to git — there is no working-tree time machine, so the commit cadence below is carrying more of the load here.

Underneath either tool, git is the shared substrate and the only undo that survives a power cut: commit at every working state, and let the agent write the commit messages — it was there, it knows what changed, and it is not embarrassed to write “fix the zone join, again.”

Guided Run — Plan Before You Touch

Field Terminal — session: b2-plan-audit Claude Code

claude --permission-mode plan

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make check-b2 passes

Build the home, fill it with verified raw data, and produce the approved audit plan — in that order, with nothing computed until step 4 signs off.

Scaffold the repository as above; commit the empty structure with a README line per directory stating its contract.
Pin the environment (uv or renv) and commit the lockfile.
Run scripts/download_trips.sh for all 24 months; verify shasum -a 256 -c data/raw/SHA256SUMS passes on a second run.
Produce docs/data-audit-plan.md by read-only reconnaissance — per your tool, below — and approve it deliberately, as the senior colleague you are standing in for.

Claude Code

Engage Plan mode and dispatch the data-audit brief from the Mechanics section. Read the proposed plan critically — does it name the join keys? does it say what each aggregation loses? — request one revision (there is always one), then approve. Only after approval, exit Plan mode.

Codex

Switch to the read-only approval mode, dispatch the explorer subagent with the data-audit brief from the Mechanics section, and hold the seam yourself: the plan gets your written sign-off in the document header before the session leaves read-only. Request one revision (there is always one), then sign.

make check-b2 verifies the scaffold, the lockfile, the checksum manifest, and the existence of the approved plan. The artifact feeds the whole midgame: C2’s contracts enforce what the plan promised, and C3 builds the panel the plan specified.

Milestone gate · make check-b2advances B2

Repo scaffold committed — data/raw append-only by convention, one contract line per directory
data/processed and results/ must be rebuildable from raw by code in the repo.
Environment pinned and the lockfile committed (uv.lock or renv.lock)
All 24 months of yellow + green trips downloaded by plain shell
shasum -a 256 -c data/raw/SHA256SUMS passes on a re-run
The manifest is how you learn when TLC silently re-publishes a month.
docs/data-audit-plan.md written from read-only reconnaissance and approved before any transform ran
Joins with keys and expected cardinality; aggregations with what each loses; timezone policy.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both]

Letting the agent “organize” a messy folder without a plan. It will — at speed, with confidence, and with its own opinions about what final_v2 meant. Reconnaissance and a written plan first; file moves are transforms too.
[both] 〜〜

Unpinned environments mean nothing reproduces next month — a dependency’s minor release changes a default, and an estimate moves with no code change anywhere. For a report, that is not an inconvenience; it is a correction waiting to be written. Commit the lockfile and rebuild from it, every time.
[CC]

Treating rollback as emergency-only. If undoing costs a keystroke, risky experiments cost nothing and you will run more of them; a rollback you are reluctant to use is a tax on exactly the exploration this stage needs.
[both] 〜〜

Downloads without checksums. TLC re-publishes months with revised rows, quietly; without a manifest, your collaborator’s “same data” and yours can differ and neither of you will know until the estimates disagree. The checksum line is one command and it is the difference between “the data changed” being a fact and a suspicion.

Check Your Bearings

B2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Record the riskiest join named in your audit plan, the revision you requested, and the moment you approved it — before any data was touched.

as of June 2026

Plan mode is genuinely one-sided: Claude Code holds the read-only reconnaissance and the plan-approval gate as a first-class mode with dedicated Explore and Plan agents, while Codex reaches the same discipline by configuration — a read-only approval mode plus its explorer subagent — with the approval seam held by your checklist rather than the tool. On undo, the asymmetry runs the same direction at smaller scale: working-tree checkpoints with rewind on one side, conversation forking plus git on the other. Sessions persist and resume on both.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

The Pain

Why / When

Mechanics

A repo agents can navigate

Python

R

Raw data by plain shell

Plan before you touch data

Sessions, checkpoints, undo

Guided Run — Plan Before You Touch

✳ Claude Code

⬡ Codex

Pitfalls & Gotchas

Parity note

Claude Code

Codex