Cheat sheet

A2 beginner ~30 min

Directing the Work

Advances A2

The Pain

The first hour was electric. You asked a question, the assistant ran a query, you asked another, and for a while it felt like the work was doing itself. Then, somewhere around the fortieth exchange, the answers started to drift. You asked for the count of zero-fare trips and got a number that did not match the one from twenty minutes earlier — same file, same month, different answer, and no error anywhere to explain it. You scrolled up and could not find where the session had gone wrong, because nothing had gone wrong, exactly; it had quietly accumulated. Half-finished tangents about timezones, a plot you abandoned, three competing definitions of “valid trip” that you had floated and never resolved, all of it still sitting in the conversation, all of it still being weighed.

This is the failure that has no traceback. An assistant that can do anything will, if you let it, do a little of everything, and a long unstructured session degrades the way a tired person does — slowly, plausibly, and without announcing it. The thing nobody tells you on day one is that the agent is only as good as the way you drive it. The work that came out of that fortieth exchange was not wrong because the model was weak. It was wrong because you had given it forty things to remember and no discipline about which of them still mattered, and then you trusted a number it produced without ever asking to see the query underneath. You were a passenger in your own analysis.

Why / When

Before you configure anything, you learn to drive. Three skills, none of them about the data and all of them about the session. The command surface: the handful of session controls you reach for daily — switching models, clearing or compacting the conversation, previewing an undo, seeing what the agent may do unasked. The prompting patterns: five habits that turn a vague request into a defensible one, named once here and used for the rest of the course. Context hygiene: understanding that the conversation is a finite working memory, and that a long session left untended degrades — so you reset it deliberately at the right seams.

These accelerate every stage that follows, because every stage is driven from a session. The lab role they absorb is none in particular — it is you, working an order of magnitude faster, but only if you direct the work instead of narrating at it. The skill is cheap to learn and expensive to skip: the researcher who never learns to compact a session pays for it in silently degraded answers on exactly the long investigations that matter most.

Contrary winds

Not for: a single self-contained question you can ask, answer, and close in one breath — driving discipline is for sessions that live long enough to drift, not for a one-line lookup.

Mechanics

Three things to learn, in order: the controls, the patterns, the hygiene. The controls and the image-input surface are dialect, so they live in tabs; the five patterns and the hygiene rule are shared — they read identically in both tools.

The command surface

A session is driven by two kinds of input: plain-language prompts to the agent, and slash-commands to the session itself — meta-controls that switch models, manage the conversation, and show you what is happening. Learn the daily handful. One distinction spans both tools and matters later: a built-in command ships in the box and is fixed; a skill is something your lab authors and invokes behind a slash of its own (C1 makes you write one). Same syntax, different origin.

Claude Code

The daily set in Claude Code, with what each is for:

session controls — the daily handful
/help this list
/model model + reasoning effort (match effort to stakes)
/clear drop the conversation — start clean
/compact summarize-and-continue — keep the findings, shed the noise
/rewind checkpoint preview + restore (preview only until you confirm)
/permissions what may run unasked
/resume pick up an earlier session
/cost this session's token ledger

Reach for @path to point the agent at a file in place — @data/raw/yellow_2024-03.parquet reads it without your pasting a byte. The two you will misuse first are /clear and /compact: /clear throws the whole conversation away (right between unrelated tasks, catastrophic mid-investigation), while /compact summarizes it and continues (right at a natural boundary inside one task). /rewind previews the session’s checkpoints and restores nothing until you confirm — the undo lane B2 stages in full.

Codex

The daily set in Codex, with what each is for (typed as slash-commands at the prompt; the command-tour run below shows the live surface):

session controls — the daily handful
model model + reasoning effort (match effort to stakes)
approvals switch approval mode (read-only · on-request · full-auto)
new start a fresh session — drop the conversation
compact summarize-and-continue — keep the findings, shed the noise
skills list the skills available to this session
agent dispatch a subagent (default · worker · explorer)
status model · sandbox · approvals at a glance

Reach for @file to pull a file into the prompt in place — @data/raw/yellow_2024-03.parquet reads it without your pasting a byte. The two you will misuse first are new and compact: starting a fresh session drops everything (right between unrelated tasks, catastrophic mid-investigation), while compacting summarizes and continues (right at a natural boundary inside one task). The approvals control is where you tighten or loosen the consultation cadence; file-state undo lives in git here, not a built-in rewind preview — B2 carries that honest comparison.

Five prompting patterns

These are shared — the same five habits in either tool, named once here and relied on for the rest of the course. They are the difference between a request the agent can satisfy sloppily and one it cannot.

  1. Point at files, not pastes. @data/raw/yellow_2024-03.parquet, never forty pasted rows. Pasting burns context, loses provenance, and caps the agent at what you happened to copy.
  2. Demand artifacts. “Write it to journal/first-look.md,” not “tell me.” A finding that lives only in the scrollback is a finding you will lose at the next /compact.
  3. Make it show its work. “Show the query and the row count behind every claim.” This is the single most important habit in empirical work — and the seed of C2’s contracts, where showing your work stops being a courtesy and becomes enforced law.
  4. Scope the task. One month, one question. “The five worst problems in this file,” not “audit the dataset.” A scoped task finishes, reports, and leaves you something to check; an unscoped one wanders.
  5. Course-correct early. Interrupt the moment it heads wrong. A two-word correction at exchange three is free; the same correction at exchange forty means unwinding everything built on the detour. Interrupting beats sunk cost.

Pattern 3 is the one that carries the project, so make it concrete. When you demand “the row count behind every claim,” the receipt the agent writes back is real code in whichever language you work — the count for one of the worst-five probes, the same verdict either way:

Python

the receipt behind one claim — Python
import duckdb
n = duckdb.sql(
"SELECT count(*) FROM 'data/raw/yellow_2024-03.parquet' "
"WHERE passenger_count IS NULL"
).fetchone()[0]
print(f"passenger_count IS NULL: {n:,}") # passenger_count IS NULL: 426,190

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

the receipt behind one claim — R
library(duckdb)
con <- dbConnect(duckdb())
n <- dbGetQuery(con, paste(
"SELECT count(*) FROM 'data/raw/yellow_2024-03.parquet'",
"WHERE passenger_count IS NULL"))[[1]]
cat(sprintf("passenger_count IS NULL: %s\n", format(n, big.mark = ",")))
# passenger_count IS NULL: 426,190

The query is the claim; the number is its receipt. The agent obliges in either language — A1’s policy, now load-bearing.

Context hygiene

The conversation is a finite working memory — the context window — and everything in it is weighed on every turn. Two consequences. First, a long unstructured session degrades: the tangents, the abandoned plots, the three competing definitions of “valid trip” all dilute the agent’s attention, which is the silent drift from the Pain vignette. Second, you control the working memory with exactly two moves, and choosing between them is the whole skill:

  • Clear the session (/clear, or /new) between unrelated tasks — zero carry-over, a deliberately blank slate. The right call when you switch from cleaning to plotting; the wrong call mid-investigation, because it throws away the table you still need.
  • Compact the session (/compact) at a natural boundary inside one task — it summarizes the conversation, pins the findings, and sheds the raw scan transcripts, then continues. The right call when one long task has bloated the context but its conclusions still matter.

Long EDA sessions are where this bites hardest, and the proper fix — dispatching the noisy survey work to a subagent in its own context — waits for D1. Until then, compact at the seams and clear between tasks.

Image input

Data work is not all text. A plot the agent drew, a screenshot of a table from a paper you are replicating, a figure from a referee report — all of it is prompt material, and both tools accept it. The surface is dialect:

Claude Code

Paste an image straight into the prompt, or point at one by path the same way you point at data: Compare my plot.png to @figures/published_demand.png — do the peak hours line up? The agent reads both images and answers about what it sees, which is the fastest way to sanity-check a replication against a published figure.

Codex

Attach an image to the prompt as input, or reference a saved one by path: Compare my plot.png to @figures/published_demand.png — do the peak hours line up? Codex also reads UI and web state directly through its in-app browser and Appshots, but for static figures the image-input path is the everyday tool — the fastest way to check a replication against a published plot.

Guided Run — Directing the Work

Field Terminal — session: a2-command-tour Claude Code
claude

Field Assignment

Artifact journal/first-look.md exists with five problems, five queries, five row counts

Run a real first look at one month of the data, and produce the first entry in the discipline that runs to the end of the course. You will use all five patterns and at least one hygiene move without being told which.

Point your tool at a single month — data/raw/yellow_2024-03.parquet, 3,582,628 rows — and direct it through a first-look session:

Claude Code

  1. Launch in the starter repo. Check /model and leave it at the default for this scope.
  2. Find the five worst data-quality problems in the month, pointing at the file with @ (pattern 1) and demanding the query and row count behind every one (pattern 3).
  3. When the scan has bloated the context, /compact — keep the worst-five table, shed the raw transcripts.
  4. Demand the artifact (pattern 2): write the findings to journal/first-look.md, every claim carrying its query and count.

Codex

  1. Launch in the starter repo. Check the model control and leave it at the default for this scope.
  2. Find the five worst data-quality problems in the month, pointing at the file with @ (pattern 1) and demanding the query and row count behind every one (pattern 3).
  3. When the scan has bloated the context, compact the session — keep the worst-five table, shed the raw transcripts.
  4. Demand the artifact (pattern 2): write the findings to journal/first-look.md, every claim carrying its query and count.

A correct first look finds them in the millions, not the dozens — missing passenger counts dominate the month, not the dramatic negative fares:

The five worst things in one month of taxi data
Top data-quality problems in yellow_tripdata_2024-03, counted by direct SQL: missing passenger counts (426,190), zero-distance paid trips (70,452), negative fares (58,464), zero-passenger trips (40,372), undocumented rate code 99 (37,930).
the numbers behind this figure

data window 2024-02, 2024-03, 2024-06 (yellow taxi; local time America/New_York)

generated by figures-pipeline/src/figures.py · a2-worst-five

null_passengers count = 426,190

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND passenger_count IS NULL

zero_distance_paid count = 70,452

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND trip_distance = 0 AND fare_amount > 5

negative_fare count = 58,464

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND fare_amount < 0

zero_passengers count = 40,372

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND passenger_count = 0

ratecode_99 count = 37,930

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND RatecodeID = 99

speed_over_65 count = 1,340

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND tpep_dropoff_datetime > tpep_pickup_datetime AND trip_distance / (epoch(tpep_dropoff_datetime - tpep_pickup_datetime)/3600.0) > 65

zero_or_negative_duration count = 1,128

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND tpep_dropoff_datetime <= tpep_pickup_datetime

misdated_pickup count = 23

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND strftime(tpep_pickup_datetime,'%Y-%m') <> '2024-03'

phantom_dst_hour count = 0

SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND tpep_pickup_datetime >= TIMESTAMP '2024-03-10 02:00' AND tpep_pickup_datetime <  TIMESTAMP '2024-03-10 03:00'

honesty note All nine probes shipped (including the two that found little: 23 misdated rows, 0 phantom-DST trips); the figure shows the top five by count.

The artifact is journal/first-look.md — five problems, five queries, five row counts. This opens the journal discipline: from here on, every lesson ends by logging incidents, decisions, and costs to journal/, and F1 totals what your lab actually did. The first look feeds B1, where the worst of these problems become rules in the lab manual.

Milestone gate · make check-a2advances A2
  1. Missing passenger counts dominate the month (426,190), not the dramatic negative fares.

  2. This is the seed of C2's contracts; 'show your work' starts here as a refused shortcut.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    Accepting a claim about data quality without the query and row count behind it. “About forty thousand bad rows” is a vibe; “40,372, predicate passenger_count = 0, over 3,582,628 rows” is a fact you can re-run. Demand the receipt every time — the habit that becomes C2’s law starts as a sentence you refuse to skip.

  • [both] 〜〜

    Never resetting the context between unrelated explorations. The answers degrade silently — no error, just a count that no longer matches the one from twenty minutes ago — and you will trust the degraded one because nothing flagged it. /clear between tasks; /compact at boundaries.

  • [both]

    Confusing clear with compact. /clear (or /new) throws the conversation away; /compact keeps its conclusions. Reach for the wrong one mid-investigation and you either drag forty exchanges of noise forward or lose the table you needed. The rule: clear between tasks, compact inside one.

  • [both]

    Letting a wrong turn run on sunk-cost momentum. The agent will build confidently on a bad early assumption; a two-word interrupt at exchange three is free, and the same correction at exchange forty costs you the detour and everything stacked on it. Course-correct early.

Check Your Bearings

A2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

Session controls are near-parity with honest naming differences. Both tools expose model and reasoning-effort selection, a compact-and-continue move, a fresh-start move, an approval surface, and @-file references — but the vocabulary diverges (/clear vs /new, a built-in /rewind preview on one side against git-based file undo on the other) and the conversation-management seams sit in slightly different places. The five prompting patterns and the clear-versus-compact hygiene rule are fully shared: they are habits, not syntax, and they read the same in either tool. Image input is parity for static figures; Codex additionally reads live UI through its in-app browser, which Claude Code reaches via the Playwright MCP rather than natively.

Ledger — A2

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.