A2 beginner ~30 min

Directing the Work

Advances A2

The Pain

The first hour was electric. You asked a question, the assistant ran a query, you asked another, and for a while it felt like the work was doing itself. Then, somewhere around the fortieth exchange, the answers started to drift. You asked for the count of zero-fare trips and got a number that did not match the one from twenty minutes earlier — same file, same month, different answer, and no error anywhere to explain it. You scrolled up and could not find where the session had gone wrong, because nothing had gone wrong, exactly; it had quietly accumulated. Half-finished tangents about timezones, a plot you abandoned, three competing definitions of “valid trip” that you had floated and never resolved, all of it still sitting in the conversation, all of it still being weighed.

This is the failure that has no traceback. An assistant that can do anything will, if you let it, do a little of everything, and a long unstructured session degrades the way a tired person does — slowly, plausibly, and without announcing it. The thing nobody tells you on day one is that the agent is only as good as the way you drive it. The work that came out of that fortieth exchange was not wrong because the model was weak. It was wrong because you had given it forty things to remember and no discipline about which of them still mattered, and then you trusted a number it produced without ever asking to see the query underneath. You were a passenger in your own analysis.

Why / When

Before you configure anything, you learn to drive. Three skills, none of them about the data and all of them about the session. The command surface: the handful of session controls you reach for daily — switching models, clearing or compacting the conversation, previewing an undo, seeing what the agent may do unasked. The prompting patterns: five habits that turn a vague request into a defensible one, named once here and used for the rest of the course. Context hygiene: understanding that the conversation is a finite working memory, and that a long session left untended degrades — so you reset it deliberately at the right seams.

These accelerate every stage that follows, because every stage is driven from a session. The lab role they absorb is none in particular — it is you, working an order of magnitude faster, but only if you direct the work instead of narrating at it. The skill is cheap to learn and expensive to skip: the researcher who never learns to compact a session pays for it in silently degraded answers on exactly the long investigations that matter most.

Contrary winds

Not for: a single self-contained question you can ask, answer, and close in one breath — driving discipline is for sessions that live long enough to drift, not for a one-line lookup.

Mechanics

Three things to learn, in order: the controls, the patterns, the hygiene. The controls and the image-input surface are dialect, so they live in tabs; the five patterns and the hygiene rule are shared — they read identically in both tools.

The command surface

A session is driven by two kinds of input: plain-language prompts to the agent, and slash-commands to the session itself — meta-controls that switch models, manage the conversation, and show you what is happening. Learn the daily handful. One distinction spans both tools and matters later: a built-in command ships in the box and is fixed; a skill is something your lab authors and invokes behind a slash of its own (C1 makes you write one). Same syntax, different origin.

Claude Code

The daily set in Claude Code, with what each is for:

/help          this list
/model         model + reasoning effort (match effort to stakes)
/clear         drop the conversation — start clean
/compact       summarize-and-continue — keep the findings, shed the noise
/rewind        checkpoint preview + restore (preview only until you confirm)
/permissions   what may run unasked
/resume        pick up an earlier session
/cost          this session's token ledger

Reach for @path to point the agent at a file in place — @data/raw/yellow_2024-03.parquet reads it without your pasting a byte. The two you will misuse first are /clear and /compact: /clear throws the whole conversation away (right between unrelated tasks, catastrophic mid-investigation), while /compact summarizes it and continues (right at a natural boundary inside one task). /rewind previews the session’s checkpoints and restores nothing until you confirm — the undo lane B2 stages in full.

Codex

The daily set in Codex, with what each is for (typed as slash-commands at the prompt; the command-tour run below shows the live surface):

model         model + reasoning effort (match effort to stakes)
approvals     switch approval mode (read-only · on-request · full-auto)
new           start a fresh session — drop the conversation
compact       summarize-and-continue — keep the findings, shed the noise
skills        list the skills available to this session
agent         dispatch a subagent (default · worker · explorer)
status        model · sandbox · approvals at a glance

Reach for @file to pull a file into the prompt in place — @data/raw/yellow_2024-03.parquet reads it without your pasting a byte. The two you will misuse first are new and compact: starting a fresh session drops everything (right between unrelated tasks, catastrophic mid-investigation), while compacting summarizes and continues (right at a natural boundary inside one task). The approvals control is where you tighten or loosen the consultation cadence; file-state undo lives in git here, not a built-in rewind preview — B2 carries that honest comparison.

Five prompting patterns

These are shared — the same five habits in either tool, named once here and relied on for the rest of the course. They are the difference between a request the agent can satisfy sloppily and one it cannot.

Point at files, not pastes. @data/raw/yellow_2024-03.parquet, never forty pasted rows. Pasting burns context, loses provenance, and caps the agent at what you happened to copy.
Demand artifacts. “Write it to journal/first-look.md,” not “tell me.” A finding that lives only in the scrollback is a finding you will lose at the next /compact.
Make it show its work. “Show the query and the row count behind every claim.” This is the single most important habit in empirical work — and the seed of C2’s contracts, where showing your work stops being a courtesy and becomes enforced law.
Scope the task. One month, one question. “The five worst problems in this file,” not “audit the dataset.” A scoped task finishes, reports, and leaves you something to check; an unscoped one wanders.
Course-correct early. Interrupt the moment it heads wrong. A two-word correction at exchange three is free; the same correction at exchange forty means unwinding everything built on the detour. Interrupting beats sunk cost.

Pattern 3 is the one that carries the project, so make it concrete. When you demand “the row count behind every claim,” the receipt the agent writes back is real code in whichever language you work — the count for one of the worst-five probes, the same verdict either way:

Python

import duckdb
n = duckdb.sql(
    "SELECT count(*) FROM 'data/raw/yellow_2024-03.parquet' "
    "WHERE passenger_count IS NULL"
).fetchone()[0]
print(f"passenger_count IS NULL: {n:,}")   # passenger_count IS NULL: 426,190

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

library(duckdb)
con <- dbConnect(duckdb())
n <- dbGetQuery(con, paste(
  "SELECT count(*) FROM 'data/raw/yellow_2024-03.parquet'",
  "WHERE passenger_count IS NULL"))[[1]]
cat(sprintf("passenger_count IS NULL: %s\n", format(n, big.mark = ",")))
# passenger_count IS NULL: 426,190

The query is the claim; the number is its receipt. The agent obliges in either language — A1’s policy, now load-bearing.

Context hygiene

The conversation is a finite working memory — the context window — and everything in it is weighed on every turn. Two consequences. First, a long unstructured session degrades: the tangents, the abandoned plots, the three competing definitions of “valid trip” all dilute the agent’s attention, which is the silent drift from the Pain vignette. Second, you control the working memory with exactly two moves, and choosing between them is the whole skill:

Clear the session (/clear, or /new) between unrelated tasks — zero carry-over, a deliberately blank slate. The right call when you switch from cleaning to plotting; the wrong call mid-investigation, because it throws away the table you still need.
Compact the session (/compact) at a natural boundary inside one task — it summarizes the conversation, pins the findings, and sheds the raw scan transcripts, then continues. The right call when one long task has bloated the context but its conclusions still matter.

Long EDA sessions are where this bites hardest, and the proper fix — dispatching the noisy survey work to a subagent in its own context — waits for D1. Until then, compact at the seams and clear between tasks.

Image input

Data work is not all text. A plot the agent drew, a screenshot of a table from a paper you are replicating, a figure from a referee report — all of it is prompt material, and both tools accept it. The surface is dialect:

Claude Code

Paste an image straight into the prompt, or point at one by path the same way you point at data: Compare my plot.png to @figures/published_demand.png — do the peak hours line up? The agent reads both images and answers about what it sees, which is the fastest way to sanity-check a replication against a published figure.

Codex

Attach an image to the prompt as input, or reference a saved one by path: Compare my plot.png to @figures/published_demand.png — do the peak hours line up? Codex also reads UI and web state directly through its in-app browser and Appshots, but for static figures the image-input path is the everyday tool — the fastest way to check a replication against a published plot.

Guided Run — Directing the Work

Field Terminal — session: a2-command-tour Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact journal/first-look.md exists with five problems, five queries, five row counts

Run a real first look at one month of the data, and produce the first entry in the discipline that runs to the end of the course. You will use all five patterns and at least one hygiene move without being told which.

Point your tool at a single month — data/raw/yellow_2024-03.parquet, 3,582,628 rows — and direct it through a first-look session:

Claude Code

Launch in the starter repo. Check /model and leave it at the default for this scope.
Find the five worst data-quality problems in the month, pointing at the file with @ (pattern 1) and demanding the query and row count behind every one (pattern 3).
When the scan has bloated the context, /compact — keep the worst-five table, shed the raw transcripts.
Demand the artifact (pattern 2): write the findings to journal/first-look.md, every claim carrying its query and count.

Codex

Launch in the starter repo. Check the model control and leave it at the default for this scope.
Find the five worst data-quality problems in the month, pointing at the file with @ (pattern 1) and demanding the query and row count behind every one (pattern 3).
When the scan has bloated the context, compact the session — keep the worst-five table, shed the raw transcripts.
Demand the artifact (pattern 2): write the findings to journal/first-look.md, every claim carrying its query and count.

A correct first look finds them in the millions, not the dozens — missing passenger counts dominate the month, not the dramatic negative fares:

The five worst things in one month of taxi data — Top data-quality problems in yellow_tripdata_2024-03, counted by direct SQL: missing passenger counts (426,190), zero-distance paid trips (70,452), negative fares (58,464), zero-passenger trips (40,372), undocumented rate code 99 (37,930).

The artifact is journal/first-look.md — five problems, five queries, five row counts. This opens the journal discipline: from here on, every lesson ends by logging incidents, decisions, and costs to journal/, and F1 totals what your lab actually did. The first look feeds B1, where the worst of these problems become rules in the lab manual.

Milestone gate · make check-a2advances A2

A first-look session run on one month (yellow_2024-03.parquet, 3,582,628 rows)
The five worst data-quality problems found by pointing at the file with @ — no rows pasted
Missing passenger counts dominate the month (426,190), not the dramatic negative fares.
Each problem carries its query and row count — the claim and its receipt
This is the seed of C2's contracts; 'show your work' starts here as a refused shortcut.
A hygiene move used deliberately — /compact at the boundary, not /clear mid-investigation
journal/first-look.md written: five problems, five queries, five counts — the journal discipline opens

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Accepting a claim about data quality without the query and row count behind it. “About forty thousand bad rows” is a vibe; “40,372, predicate passenger_count = 0, over 3,582,628 rows” is a fact you can re-run. Demand the receipt every time — the habit that becomes C2’s law starts as a sentence you refuse to skip.
[both] 〜〜

Never resetting the context between unrelated explorations. The answers degrade silently — no error, just a count that no longer matches the one from twenty minutes ago — and you will trust the degraded one because nothing flagged it. /clear between tasks; /compact at boundaries.
[both]

Confusing clear with compact. /clear (or /new) throws the conversation away; /compact keeps its conclusions. Reach for the wrong one mid-investigation and you either drag forty exchanges of noise forward or lose the table you needed. The rule: clear between tasks, compact inside one.
[both]

Letting a wrong turn run on sunk-cost momentum. The agent will build confidently on a bad early assumption; a two-word interrupt at exchange three is free, and the same correction at exchange forty costs you the detour and everything stacked on it. Course-correct early.

Check Your Bearings

A2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Record the first look: the five worst problems you found in the month, each with its query and row count, and which hygiene move (clear or compact) you used and why.

as of June 2026

Session controls are near-parity with honest naming differences. Both tools expose model and reasoning-effort selection, a compact-and-continue move, a fresh-start move, an approval surface, and @-file references — but the vocabulary diverges (/clear vs /new, a built-in /rewind preview on one side against git-based file undo on the other) and the conversation-management seams sit in slightly different places. The five prompting patterns and the clear-versus-compact hygiene rule are fully shared: they are habits, not syntax, and they read the same in either tool. Image input is parity for static figures; Codex additionally reads live UI through its in-app browser, which Claude Code reaches via the Playwright MCP rather than natively.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

Directing the Work

The Pain

Why / When

Mechanics

The command surface

Claude Code

Codex

Five prompting patterns

Python

R

Context hygiene

Image input

Claude Code

Codex

Guided Run — Directing the Work

Field Assignment

Claude Code

Codex

null_passengers count = 426,190

zero_distance_paid count = 70,452

negative_fare count = 58,464

zero_passengers count = 40,372

ratecode_99 count = 37,930

speed_over_65 count = 1,340

zero_or_negative_duration count = 1,128

misdated_pickup count = 23

phantom_dst_hour count = 0

Pitfalls & Gotchas

Check Your Bearings

Ledger — A2

The Lab Roster

Your position

Positions

Running Totals

The Pain

Why / When

Mechanics

The command surface

✳ Claude Code

⬡ Codex

Five prompting patterns

Python

R

Context hygiene

Image input

✳ Claude Code

⬡ Codex

Guided Run — Directing the Work

✳ Claude Code

⬡ Codex

null_passengers count = 426,190

zero_distance_paid count = 70,452

negative_fare count = 58,464

zero_passengers count = 40,372

ratecode_99 count = 37,930

speed_over_65 count = 1,340

zero_or_negative_duration count = 1,128

misdated_pickup count = 23

phantom_dst_hour count = 0

Pitfalls & Gotchas

Parity note

Claude Code

Codex

Claude Code

Codex

Claude Code

Codex