The Pain
Every research project that dies of disorganization dies the same way,
and it starts innocently: analysis_v2_final.py next to
analysis_v2_final_REAL.py, a temp/ directory that is four months
old, raw files edited “only that once” to fix an encoding, and a
notebook whose cells run correctly in exactly one order that nobody
wrote down. You know this repo. You may, somewhere on a backup drive,
own this repo.
The second act is quieter and more expensive. Week two, eager for a
first result, you joined trips to zones on a column that was almost,
but not exactly, the key — LocationID against a lookup that had been
deduplicated differently — and the join was wrong by a few hundred
rows out of millions. Nothing crashed; the panel built; results
accumulated on top. By the time the discrepancy surfaced, fourteen
downstream files assumed the bad join, and “fix it” meant “find
everything that ever touched it.” A senior colleague would have
stopped you on day one: structure first, then a written plan of every
join and aggregation, reviewed before anything is computed —
because aggregations are lossy, queries cost money, and a wrong early
join poisons everything downstream. That careful senior is a luxury
hire. You are about to add a collaborator who works at machine speed
and amplifies whatever structure exists — including none.
Why / When
This lesson sets up the three disciplines that make agent mistakes cheap, in rising order of subtlety. Structure: a repository layout where every path means something, so an assistant — human or machine — can navigate it without folklore. Plan-first: a read-only reconnaissance of the data, producing a written audit plan that you approve before any transform runs; the cheapest moment to catch the wrong join is before it executes. Undo: sessions, checkpoints, and cheap rollback, so that experiments cost minutes instead of forensics. Together they absorb the careful senior who plans before touching data, and they accelerate the project-setup and data-acquisition stages — the unglamorous fifth of the project that determines whether the other four-fifths reproduce.
An agent multiplies whatever it finds: in a structured repo it re-runs
pipelines and respects fences; in temp/-and-versions soup it
generates more soup, faster than you ever could alone.
Contrary winds
Not for: a one-afternoon scratch analysis of data you may never touch again — though count, honestly, how many of those became Chapter 2.
Mechanics
Three disciplines, in order: a navigable home, the plan-first habit, and cheap undo.
A repo agents can navigate
The scaffold is the contract every later lesson assumes — B3 fences it, C2 enforces it, D4 writes results into it:
weather-mobility/├── data/│ ├── raw/ # append-only; never edited (convention now, law in B3/C2)│ └── processed/ # rebuilt from raw by scripts — always disposable├── src/ # transforms and estimation — the only code that merges├── scripts/ # download, checksum, validation — plain shell lives here├── notebooks/ # exploration; nothing load-bearing may live here├── docs/ # the data-audit plan, design memos├── results/ # written by runs, owned by contracts (D4)├── report/ # the report skeleton├── journal/ # incidents, decisions, costs — F1's raw material└── Makefile # make check-b2 … make check-f1Two conventions carry most of the weight. data/raw/ is append-only:
files arrive by script, are never edited, and are the only thing that
cannot be regenerated — everything in data/processed/ and results/
must be rebuildable from raw by code in the repo. And notebooks/ is
quarantine: the moment an exploration matters, it graduates to src/.
Pin the environment, because agents re-run everything, often — an unpinned dependency is a result that changes when nobody changed anything:
Python
uv init --python 3.12uv add duckdb pyarrow pandas# uv.lock now pins the exact tree; commit it.# Any rerun — yours, the agent's, a referee's — goes through:uv run python src/build_panel.pyThis block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).
R
renv::init()renv::install(c("duckdb", "arrow", "data.table"))renv::snapshot() # renv.lock now pins the exact tree; commit it.# Any rerun — yours, the agent's, a referee's — starts from:renv::restore()Raw data by plain shell
Twenty-four months of yellow and green trips arrive by the simplest
tool that works — a download script and a checksum manifest. (To just
have the fixed course slice on disk, run the kit’s python3 get_data.py
— see Get the data; here we build the fetch ourselves to
learn the pattern.) This is a deliberate lesson: fetching files needs no
protocol, no server, no integration; richer plumbing earns its keep in
C3, when the agent needs live systems, not files.
#!/usr/bin/env bashset -euo pipefailBASE=https://d37ci6vzurychx.cloudfront.net/trip-data
while read -r ym; do # months.txt: 2023-01 … 2024-12 for color in yellow green; do out="data/raw/${color}_${ym}.parquet" # Download to a temp name and move only on success: an interrupted # curl must never leave a truncated file for the checksum manifest # to canonize. [ -f "$out" ] || { curl -fsSL -o "${out}.tmp" \ "${BASE}/${color}_tripdata_${ym}.parquet" mv "${out}.tmp" "$out" } donedone < scripts/months.txt
shasum -a 256 data/raw/*.parquet > data/raw/SHA256SUMSThe manifest is the point, not the download. TLC re-publishes months —
silently, with revised rows — and shasum -a 256 -c data/raw/SHA256SUMS
(from the repo root, where the manifest’s paths resolve) is how you find
out on your terms instead of mid-revision. The checksums are
also the data-version hash that C2’s session briefing will print at
the start of every session.
Plan before you touch data
Now the discipline this page is named for. The first pass over raw data should be read-only reconnaissance ending in a written plan — which joins, on which keys, validated how; which aggregations, with what lost; what gets checked before anything is computed. One tool makes this a first-class mode.
Claude Code Your tool
Plan mode + the Explore and Plan agents
Plan mode is a switchable agent state in which nothing is written: the agent may read files, run read-only commands, and think, but every mutating action is off the table until you approve a written plan. Engage it and hand over the reconnaissance:
> (plan mode) Survey data/raw/: schemas of all 48 parquet files and the zone lookup. Propose docs/data-audit-plan.md covering: every join in the panel build (keys, expected cardinality, how validated), every aggregation (what is lost at each), null-rate and range checks per column, and the timezone policy for all timestamp joins. Flag the three joins most likely to go wrong.The built-in Explore agent does the breadth work — schema by schema, in a separate context so the survey’s noise never pollutes the main session — and the Plan agent assembles the plan for your approval. The output is a document, and that is the point: the wrong early join from the Pain vignette is caught at the cost of reading two pages, not at the cost of fourteen downstream files. You approve the plan; only then does anything execute.
Nearest equivalent — Codex
The same discipline assembles from two pieces: switch the session to
the read-only approval mode (the sandbox refuses writes, so the
reconnaissance physically cannot mutate anything) and dispatch the
built-in explorer subagent to do the survey, with the same brief
and the same mandated deliverable, docs/data-audit-plan.md. What you
lose against the native mode is the seam: leaving read-only is a
settings change you make rather than an approval gate the tool holds
you to, so the “nothing executes before the plan is approved” rule is
your habit, not the machine’s. Write the plan-approval step into the
project checklist and the habit holds.
Sessions, checkpoints, undo
The third discipline is making mistakes cheap after they happen. Both tools persist sessions; the rollback primitives differ.
Claude Code Your tool
Checkpoints + /rewind
Every agent action lands on a checkpoint timeline. When the zone join
goes sideways — wrong key, mangled output — /rewind rolls the
working tree and the conversation back to the checkpoint before the
damage, and you retry from there at the cost of a keystroke. Sessions
resume (claude --resume) and fork, so yesterday’s reconnaissance
session can branch into today’s two competing approaches without
either contaminating the other. Treat rewinding as a normal movement,
not an emergency brake: cheap experiments require cheap undo.
Codex Your tool
Conversation forking + exec resume
Conversations fork: any point in a session can branch into an
alternative line of attack while the original stays intact, which is
the natural way to try two join strategies against each other.
Headless runs continue with codex exec resume, so a long
reconnaissance survives your laptop lid. File-state rollback, though,
belongs to git — there is no working-tree time machine, so the commit
cadence below is carrying more of the load here.
Underneath either tool, git is the shared substrate and the only undo that survives a power cut: commit at every working state, and let the agent write the commit messages — it was there, it knows what changed, and it is not embarrassed to write “fix the zone join, again.”
Guided Run — Plan Before You Touch
claude --permission-mode planField Assignment
Artifact make check-b2 passes
Build the home, fill it with verified raw data, and produce the approved audit plan — in that order, with nothing computed until step 4 signs off.
- Scaffold the repository as above; commit the empty structure with a README line per directory stating its contract.
- Pin the environment (
uvorrenv) and commit the lockfile. - Run
scripts/download_trips.shfor all 24 months; verifyshasum -a 256 -c data/raw/SHA256SUMSpasses on a second run. - Produce
docs/data-audit-plan.mdby read-only reconnaissance — per your tool, below — and approve it deliberately, as the senior colleague you are standing in for.
Claude Code
Engage Plan mode and dispatch the data-audit brief from the Mechanics section. Read the proposed plan critically — does it name the join keys? does it say what each aggregation loses? — request one revision (there is always one), then approve. Only after approval, exit Plan mode.
Codex
Switch to the read-only approval mode, dispatch the explorer subagent with the data-audit brief from the Mechanics section, and hold the seam yourself: the plan gets your written sign-off in the document header before the session leaves read-only. Request one revision (there is always one), then sign.
make check-b2 verifies the scaffold, the lockfile, the checksum
manifest, and the existence of the approved plan. The artifact feeds
the whole midgame: C2’s contracts enforce what the plan promised, and
C3 builds the panel the plan specified.
make check-b2advances B2data/processed and results/ must be rebuildable from raw by code in the repo.
The manifest is how you learn when TLC silently re-publishes a month.
Joins with keys and expected cardinality; aggregations with what each loses; timezone policy.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
Letting the agent “organize” a messy folder without a plan. It will — at speed, with confidence, and with its own opinions about what
final_v2meant. Reconnaissance and a written plan first; file moves are transforms too. - [both]
〜〜
Unpinned environments mean nothing reproduces next month — a dependency’s minor release changes a default, and an estimate moves with no code change anywhere. For a report, that is not an inconvenience; it is a correction waiting to be written. Commit the lockfile and rebuild from it, every time.
- [CC]
Treating rollback as emergency-only. If undoing costs a keystroke, risky experiments cost nothing and you will run more of them; a rollback you are reluctant to use is a tax on exactly the exploration this stage needs.
- [both]
〜〜
Downloads without checksums. TLC re-publishes months with revised rows, quietly; without a manifest, your collaborator’s “same data” and yours can differ and neither of you will know until the estimates disagree. The checksum line is one command and it is the difference between “the data changed” being a fact and a suspicion.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
Plan mode is genuinely one-sided: Claude Code holds the read-only reconnaissance and the plan-approval gate as a first-class mode with dedicated Explore and Plan agents, while Codex reaches the same discipline by configuration — a read-only approval mode plus its explorer subagent — with the approval seam held by your checklist rather than the tool. On undo, the asymmetry runs the same direction at smaller scale: working-tree checkpoints with rewind on one side, conversation forking plus git on the other. Sessions persist and resume on both.