The Pain
The first hour was electric. You asked a question, the assistant ran a query, you asked another, and for a while it felt like the work was doing itself. Then, somewhere around the fortieth exchange, the answers started to drift. You asked for the count of zero-fare trips and got a number that did not match the one from twenty minutes earlier — same file, same month, different answer, and no error anywhere to explain it. You scrolled up and could not find where the session had gone wrong, because nothing had gone wrong, exactly; it had quietly accumulated. Half-finished tangents about timezones, a plot you abandoned, three competing definitions of “valid trip” that you had floated and never resolved, all of it still sitting in the conversation, all of it still being weighed.
This is the failure that has no traceback. An assistant that can do anything will, if you let it, do a little of everything, and a long unstructured session degrades the way a tired person does — slowly, plausibly, and without announcing it. The thing nobody tells you on day one is that the agent is only as good as the way you drive it. The work that came out of that fortieth exchange was not wrong because the model was weak. It was wrong because you had given it forty things to remember and no discipline about which of them still mattered, and then you trusted a number it produced without ever asking to see the query underneath. You were a passenger in your own analysis.
Why / When
Before you configure anything, you learn to drive. Three skills, none of them about the data and all of them about the session. The command surface: the handful of session controls you reach for daily — switching models, clearing or compacting the conversation, previewing an undo, seeing what the agent may do unasked. The prompting patterns: five habits that turn a vague request into a defensible one, named once here and used for the rest of the course. Context hygiene: understanding that the conversation is a finite working memory, and that a long session left untended degrades — so you reset it deliberately at the right seams.
These accelerate every stage that follows, because every stage is driven from a session. The lab role they absorb is none in particular — it is you, working an order of magnitude faster, but only if you direct the work instead of narrating at it. The skill is cheap to learn and expensive to skip: the researcher who never learns to compact a session pays for it in silently degraded answers on exactly the long investigations that matter most.
Contrary winds
Not for: a single self-contained question you can ask, answer, and close in one breath — driving discipline is for sessions that live long enough to drift, not for a one-line lookup.
Mechanics
Three things to learn, in order: the controls, the patterns, the hygiene. The controls and the image-input surface are dialect, so they live in tabs; the five patterns and the hygiene rule are shared — they read identically in both tools.
The command surface
A session is driven by two kinds of input: plain-language prompts to the agent, and slash-commands to the session itself — meta-controls that switch models, manage the conversation, and show you what is happening. Learn the daily handful. One distinction spans both tools and matters later: a built-in command ships in the box and is fixed; a skill is something your lab authors and invokes behind a slash of its own (C1 makes you write one). Same syntax, different origin.
Claude Code
The daily set in Claude Code, with what each is for:
/help this list/model model + reasoning effort (match effort to stakes)/clear drop the conversation — start clean/compact summarize-and-continue — keep the findings, shed the noise/rewind checkpoint preview + restore (preview only until you confirm)/permissions what may run unasked/resume pick up an earlier session/cost this session's token ledgerReach for @path to point the agent at a file in place — @data/raw/yellow_2024-03.parquet
reads it without your pasting a byte. The two you will misuse first are
/clear and /compact: /clear throws the whole conversation away (right
between unrelated tasks, catastrophic mid-investigation), while /compact
summarizes it and continues (right at a natural boundary inside one task).
/rewind previews the session’s checkpoints and restores nothing until you
confirm — the undo lane B2 stages in full.
Codex
The daily set in Codex, with what each is for (typed as slash-commands at the prompt; the command-tour run below shows the live surface):
model model + reasoning effort (match effort to stakes)approvals switch approval mode (read-only · on-request · full-auto)new start a fresh session — drop the conversationcompact summarize-and-continue — keep the findings, shed the noiseskills list the skills available to this sessionagent dispatch a subagent (default · worker · explorer)status model · sandbox · approvals at a glanceReach for @file to pull a file into the prompt in place —
@data/raw/yellow_2024-03.parquet reads it without your pasting a byte. The
two you will misuse first are new and compact: starting a fresh
session drops everything (right between unrelated tasks, catastrophic
mid-investigation), while compacting summarizes and continues (right at a
natural boundary inside one task). The approvals control is where you
tighten or loosen the consultation cadence; file-state undo lives in git here,
not a built-in rewind preview — B2 carries that honest comparison.
Five prompting patterns
These are shared — the same five habits in either tool, named once here and relied on for the rest of the course. They are the difference between a request the agent can satisfy sloppily and one it cannot.
- Point at files, not pastes.
@data/raw/yellow_2024-03.parquet, never forty pasted rows. Pasting burns context, loses provenance, and caps the agent at what you happened to copy. - Demand artifacts. “Write it to
journal/first-look.md,” not “tell me.” A finding that lives only in the scrollback is a finding you will lose at the next/compact. - Make it show its work. “Show the query and the row count behind every claim.” This is the single most important habit in empirical work — and the seed of C2’s contracts, where showing your work stops being a courtesy and becomes enforced law.
- Scope the task. One month, one question. “The five worst problems in this file,” not “audit the dataset.” A scoped task finishes, reports, and leaves you something to check; an unscoped one wanders.
- Course-correct early. Interrupt the moment it heads wrong. A two-word correction at exchange three is free; the same correction at exchange forty means unwinding everything built on the detour. Interrupting beats sunk cost.
Pattern 3 is the one that carries the project, so make it concrete. When you demand “the row count behind every claim,” the receipt the agent writes back is real code in whichever language you work — the count for one of the worst-five probes, the same verdict either way:
Python
import duckdbn = duckdb.sql( "SELECT count(*) FROM 'data/raw/yellow_2024-03.parquet' " "WHERE passenger_count IS NULL").fetchone()[0]print(f"passenger_count IS NULL: {n:,}") # passenger_count IS NULL: 426,190This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).
R
library(duckdb)con <- dbConnect(duckdb())n <- dbGetQuery(con, paste( "SELECT count(*) FROM 'data/raw/yellow_2024-03.parquet'", "WHERE passenger_count IS NULL"))[[1]]cat(sprintf("passenger_count IS NULL: %s\n", format(n, big.mark = ",")))# passenger_count IS NULL: 426,190The query is the claim; the number is its receipt. The agent obliges in either language — A1’s policy, now load-bearing.
Context hygiene
The conversation is a finite working memory — the context window — and everything in it is weighed on every turn. Two consequences. First, a long unstructured session degrades: the tangents, the abandoned plots, the three competing definitions of “valid trip” all dilute the agent’s attention, which is the silent drift from the Pain vignette. Second, you control the working memory with exactly two moves, and choosing between them is the whole skill:
- Clear the session (
/clear, or/new) between unrelated tasks — zero carry-over, a deliberately blank slate. The right call when you switch from cleaning to plotting; the wrong call mid-investigation, because it throws away the table you still need. - Compact the session (
/compact) at a natural boundary inside one task — it summarizes the conversation, pins the findings, and sheds the raw scan transcripts, then continues. The right call when one long task has bloated the context but its conclusions still matter.
Long EDA sessions are where this bites hardest, and the proper fix — dispatching the noisy survey work to a subagent in its own context — waits for D1. Until then, compact at the seams and clear between tasks.
Image input
Data work is not all text. A plot the agent drew, a screenshot of a table from a paper you are replicating, a figure from a referee report — all of it is prompt material, and both tools accept it. The surface is dialect:
Claude Code
Paste an image straight into the prompt, or point at one by path the same way
you point at data: Compare my plot.png to @figures/published_demand.png — do the peak hours line up? The agent reads both images and answers about what it
sees, which is the fastest way to sanity-check a replication against a
published figure.
Codex
Attach an image to the prompt as input, or reference a saved one by path:
Compare my plot.png to @figures/published_demand.png — do the peak hours line up? Codex also reads UI and web state directly through its in-app browser and
Appshots, but for static figures the image-input path is the everyday tool —
the fastest way to check a replication against a published plot.
Guided Run — Directing the Work
claudeField Assignment
Artifact journal/first-look.md exists with five problems, five queries, five row counts
Run a real first look at one month of the data, and produce the first entry in the discipline that runs to the end of the course. You will use all five patterns and at least one hygiene move without being told which.
Point your tool at a single month — data/raw/yellow_2024-03.parquet,
3,582,628 rows — and direct it through a first-look session:
Claude Code
- Launch in the starter repo. Check
/modeland leave it at the default for this scope. - Find the five worst data-quality problems in the month, pointing at the
file with
@(pattern 1) and demanding the query and row count behind every one (pattern 3). - When the scan has bloated the context,
/compact— keep the worst-five table, shed the raw transcripts. - Demand the artifact (pattern 2): write the findings to
journal/first-look.md, every claim carrying its query and count.
Codex
- Launch in the starter repo. Check the model control and leave it at the default for this scope.
- Find the five worst data-quality problems in the month, pointing at the
file with
@(pattern 1) and demanding the query and row count behind every one (pattern 3). - When the scan has bloated the context, compact the session — keep the worst-five table, shed the raw transcripts.
- Demand the artifact (pattern 2): write the findings to
journal/first-look.md, every claim carrying its query and count.
A correct first look finds them in the millions, not the dozens — missing passenger counts dominate the month, not the dramatic negative fares:
the numbers behind this figure
null_passengers count = 426,190
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND passenger_count IS NULL zero_distance_paid count = 70,452
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND trip_distance = 0 AND fare_amount > 5 negative_fare count = 58,464
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND fare_amount < 0 zero_passengers count = 40,372
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND passenger_count = 0 ratecode_99 count = 37,930
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND RatecodeID = 99 speed_over_65 count = 1,340
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND tpep_dropoff_datetime > tpep_pickup_datetime AND trip_distance / (epoch(tpep_dropoff_datetime - tpep_pickup_datetime)/3600.0) > 65 zero_or_negative_duration count = 1,128
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND tpep_dropoff_datetime <= tpep_pickup_datetime misdated_pickup count = 23
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND strftime(tpep_pickup_datetime,'%Y-%m') <> '2024-03' phantom_dst_hour count = 0
SELECT count(*) FROM trips_raw WHERE file_month='2024-03' AND tpep_pickup_datetime >= TIMESTAMP '2024-03-10 02:00' AND tpep_pickup_datetime < TIMESTAMP '2024-03-10 03:00' honesty note All nine probes shipped (including the two that found little: 23 misdated rows, 0 phantom-DST trips); the figure shows the top five by count.
The artifact is journal/first-look.md — five problems, five queries, five
row counts. This opens the journal discipline: from here on, every lesson
ends by logging incidents, decisions, and costs to journal/, and F1 totals
what your lab actually did. The first look feeds B1, where the worst of these
problems become rules in the lab manual.
make check-a2advances A2Missing passenger counts dominate the month (426,190), not the dramatic negative fares.
This is the seed of C2's contracts; 'show your work' starts here as a refused shortcut.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Accepting a claim about data quality without the query and row count behind it. “About forty thousand bad rows” is a vibe; “40,372, predicate
passenger_count = 0, over 3,582,628 rows” is a fact you can re-run. Demand the receipt every time — the habit that becomes C2’s law starts as a sentence you refuse to skip. - [both]
〜〜
Never resetting the context between unrelated explorations. The answers degrade silently — no error, just a count that no longer matches the one from twenty minutes ago — and you will trust the degraded one because nothing flagged it.
/clearbetween tasks;/compactat boundaries. - [both]
Confusing clear with compact.
/clear(or/new) throws the conversation away;/compactkeeps its conclusions. Reach for the wrong one mid-investigation and you either drag forty exchanges of noise forward or lose the table you needed. The rule: clear between tasks, compact inside one. - [both]
Letting a wrong turn run on sunk-cost momentum. The agent will build confidently on a bad early assumption; a two-word interrupt at exchange three is free, and the same correction at exchange forty costs you the detour and everything stacked on it. Course-correct early.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
Session controls are near-parity with honest naming differences. Both tools
expose model and reasoning-effort selection, a compact-and-continue move, a
fresh-start move, an approval surface, and @-file references — but the
vocabulary diverges (/clear vs /new, a built-in /rewind preview on one
side against git-based file undo on the other) and the conversation-management
seams sit in slightly different places. The five prompting patterns and the
clear-versus-compact hygiene rule are fully shared: they are habits, not
syntax, and they read the same in either tool. Image input is parity for
static figures; Codex additionally reads live UI through its in-app browser,
which Claude Code reaches via the Playwright MCP rather than natively.