The Pain
The file came down at midnight, the way these things do, and by one in the morning you had it open and you understood the shape of the next three days. Three and a half million rows of yellow-cab trips, and somewhere in the column called fare there was a charge of negative eight hundred dollars. A trip that covered a hundred and seventy-six thousand miles in thirteen minutes. A pickup timestamped to the last day of 2002, nineteen years before the meter that recorded it existed. Whole rows where the passenger count was blank — not zero, blank — and you had no way yet to know whether that was four hundred thousand rows or forty.
You are the whole lab. You are the methodologist who will eventually estimate how weather moves the demand for rides, and you are also the person who has to sit here at one in the morning deciding, by hand, what counts as a real trip. Nobody hands a graduate student a clean dataset. The cleaning is the dissertation’s foundation, and it is unglamorous, and it is yours, and it is the wall every empirical project hits first — the days that vanish before the first honest plot, the work no methods section ever describes because it is assumed and never done. You make coffee. You start writing the same defensive parsing code you have written for three previous projects, knowing that by the time it works you will have forgotten why each rule is there. The wall is not the analysis. The wall is everything between you and the analysis, and tonight, as usual, you are facing it alone.
Why / When
An agentic command-line tool is a language model given four things a chat window withholds: a filesystem it can read and write, a shell it can run commands in, persistence so it remembers the project across a session, and a set of tools it chooses among on its own. That combination is a difference in kind, not degree. A chat window can advise you about data it has never touched. An agent reads row 4,217, runs the cleaning script, sees the traceback, and fixes it — the loop this whole course is built on: prompt → act → observe → fix.
Two tools teach this course, and they share that loop while differing in temperament. One is built around a local session and small composable pieces you assemble yourself; the other leans toward delegating whole tasks, with first-class cloud runs. The concepts transfer; only the dialect changes — which is exactly why you learn both. And you learn the honest limits up front, because they are real: agents misread schemas, invent joins that look right, and will happily optimize a metric into meaninglessness if you let them. Unit C answers the first with enforcement; Unit D answers the last with adversarial review. Today is none of that. Today you watch the bare loop work, so you know what you are later making trustworthy.
In the research pipeline this is the very first stage — data cleaning and first contact — and the lab role it absorbs is no single person. It is the wall itself: the unstaffed midnight hours between a raw file and a defensible first plot.
Contrary winds
Not for: a number you can get from one line of SQL you already know — opening an agent to compute a single mean is ceremony, not leverage.
Mechanics
Today is deliberately configuration-free. You install one of the two tools, authenticate, and point it at a mangled file. No instruction files, no settings, no skills — those arrive across Units B and C. The bare loop first.
What these tools are
Both tools are the same animal: a model with hands. You type a request in plain language; the model plans, calls a tool (read a file, run a command, write a patch), reads what came back, and decides what to do next — looping until the work is done or it needs you. Two controls matter on day one and are common to both tools, under different names:
- An approval prompt / approval mode — the agent pauses before a
consequential action (writing a file, installing a package, running a
destructive command) and waits for your
y/n. This is your hand on the tiller. Deny anything you do not understand. - A model and reasoning-effort setting — which model drives, and how hard it thinks. The default is fine today; A2 makes this a daily habit.
Before you run anything, watch one full turn of the loop in slow motion. This is the “what just happened” view of everything you are about to see scroll past — the sentence becoming a tool call becoming an observation becoming the next decision.
Step 1 of 6.
It starts as a sentence, not a script. You describe the artifact you want — clean the February file and plot the fare distribution — and press enter. That sentence is the whole program.
That single turn, repeated until the work is done, is the entire mechanism. Everything else in this course makes that turn safer, cheaper, or more trustworthy.
Install and authenticate
Pick the tool you will follow the course in — you can install the other later; the concepts are identical. Install, authenticate, and read the opening banner, because the banner tells you the two things that matter: that you are running unconfigured, and what the agent may do without asking.
Claude Code
npm install -g @anthropic-ai/claude-codecd scratch/day-one # a throwaway folder holding only data/messy.csvclaude # opens a session; first run walks you through loginThe first launch sends you to the browser to authenticate, then drops you at a
prompt inside the current directory. The banner notes there is no CLAUDE.md
here — no project instructions — so the agent is running on defaults. That is
the point of day one. When the agent later wants to write a file, you will see
an approval prompt like Apply edit to clean_taxi.py? (y/n); that pause is
where you stay in control. claude --resume brings a past session back; you
will not need it today.
Codex
npm install -g @openai/codexcd scratch/day-one # a throwaway folder holding only data/messy.csvcodex # opens a session; first run walks you through sign-inThe first launch authenticates you (browser sign-in or an API key), then opens
a session in the current directory. The banner reports its sandbox and
approval mode — typically sandbox: workspace-write · approvals: on-request. Read that pair literally: the sandbox bounds where the agent may
act (this directory tree, not your whole machine), and the approval mode sets
when it consults you (on consequential actions, like writing
clean_taxi.py). Two independent dials; B3 turns both into real safety
profiles. codex exec resume continues a run headlessly; not needed today.
The dialect differs — login flow, banner wording, the resume command — but the session you are now sitting in front of is the same loop in both.
The quick win
Here is the file the agent is about to meet. These are not invented horrors; every row is verbatim from the raw 2024 yellow-cab files — the wall from the Pain vignette, made concrete.
the numbers behind this figure
ok 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-02' AND tpep_pickup_datetime BETWEEN TIMESTAMP '2024-02-14 09:00' AND TIMESTAMP '2024-02-14 09:05' AND passenger_count = 1 AND fare_amount BETWEEN 5 AND 30 AND trip_distance BETWEEN 0.5 AND 5 ORDER BY tpep_pickup_datetime LIMIT 1 misdated 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-03' AND tpep_pickup_datetime < TIMESTAMP '2024-01-01' ORDER BY tpep_pickup_datetime LIMIT 1 neg_fare 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-03' AND fare_amount = -800 AND trip_distance = 0 LIMIT 1 zero_sec 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-03' AND tpep_dropoff_datetime = tpep_pickup_datetime AND trip_distance > 1 ORDER BY tpep_pickup_datetime LIMIT 1 speed 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-03' ORDER BY trip_distance DESC LIMIT 1 null_pass 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-03' AND passenger_count IS NULL AND fare_amount > 0 ORDER BY tpep_pickup_datetime LIMIT 1 dst 1 row
SELECT tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, fare_amount, total_amount FROM trips_raw WHERE file_month='2024-03' AND tpep_pickup_datetime BETWEEN TIMESTAMP '2024-03-10 01:00' AND TIMESTAMP '2024-03-10 01:59:59' AND tpep_dropoff_datetime >= TIMESTAMP '2024-03-10 03:00' ORDER BY tpep_pickup_datetime LIMIT 1 honesty note All rows verbatim from the raw files; nothing synthesized.
Now point the agent at it. The discipline, which A2 will name formally, is already visible in the prompt: you point at the file, you do not paste its rows; and you demand artifacts — a written summary and a plot on disk, not a verdict in the scrollback.
Claude Code
> Clean data/messy.csv. Count every problem you find, drop rows by documented rules, and write a cleaned summary to cleaned_summary.md plus one plot of trips by hour to plot.png.The agent reads the head of the file first (a few hundred tokens reveal the delimiters, the dtypes, and the first specimens of trouble), proposes a small cleaning script, and pauses for your approval before writing it. You approve; it runs; it reports counts per rule. The interactive run below is that turn by turn — drive it yourself.
Codex
> Clean data/messy.csv. Count every problem you find, drop rows by documented rules, and write a cleaned summary to cleaned_summary.md plus one plot of trips by hour to plot.png.The agent reads the head of the file first (a few hundred tokens reveal the delimiters, the dtypes, and the first specimens of trouble), proposes a small cleaning script, and pauses on-request before writing it. You approve; it runs; it reports counts per rule. The prompt is identical to the Claude Code tab — same destination, same artifacts; only the approval surface around the write differs.
What you get back is the same file, cleaned and counted — every removal named, the worst offenders quoted, the survivors plotted:
the numbers behind this figure
march_cascade
SELECT count(*) AS raw,
count(*) FILTER (WHERE ok_month) AS s1,
count(*) FILTER (WHERE ok_month AND ok_fare) AS s2,
count(*) FILTER (WHERE ok_month AND ok_fare AND ok_duration) AS s3,
count(*) FILTER (WHERE ok_month AND ok_fare AND ok_duration
AND implied_mph <= 65) AS s4
FROM trips_flagged WHERE file_month='2024-03' | raw | 3,582,628 |
|---|---|
| s1 | 3,582,605 |
| s2 | 3,524,141 |
| s3 | 3,523,019 |
| s4 | 3,521,703 |
manhattan_hourly 24 rows · 3,148,474 trips total
SELECT hour(tpep_pickup_datetime) AS hh, count(*) AS trips
FROM trips_clean t JOIN zones z ON z.location_id = t.PULocationID
WHERE t.file_month = '2024-03' AND z.borough = 'Manhattan'
GROUP BY 1 ORDER BY 1 The receipts matter more than the plot. The agent did not silently delete the −800.00 on a 0.00-mile trip)” and left a script you can read and defend. That is the difference between an edit and a finding.
Ask in Python, then in R
The lab is bilingual, and the agent does not care which language it works in.
Re-ask for the same task in the other language and watch the verdicts come
back identical — the same counts, rule for rule, with dplyr filtering where
pandas masked.
Python
> Now do the same cleaning task again, this time in Python — same drop rules, same counted summary.This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).
R
> Now do the same cleaning task again, this time in R — same drop rules, same counted summary.This is the language policy, stated once for the whole course: your statistics live in Python or R, and the agentic skills you are learning transfer untouched between them. The R toggle in this site’s header works the same way — flip it and the statistical code rewrites; the lesson does not. This course teaches the tools, not the languages.
The research project
Everything from here builds one project: Weather and the Demand for Urban Mobility. The question is plain — when the weather turns, who still rides? — and the answer is a report. The data is twenty-four months of New York yellow and green taxi trips joined to weather; the deliverable is a reproducible estimate of how rain, snow, and heat move demand across the city’s zones. The messy file you just cleaned is one sample month of it.
Clone the starter kit and you have the project’s skeleton — the directory
contract every later lesson assumes, an empty journal/ for the receipts you
are about to start keeping, and a Makefile whose make check-a1 … check-f1
targets are the milestones you will tick off one unit at a time. The one
command python3 get_data.py then fetches the fixed course slice into
./data/: the 2024 yellow-taxi parquet months, the zone lookup, and the NYC
hourly weather every later lesson is built on.
git clone https://github.com/junwei-lu/agentic-datascience-course-kit.gitcd agentic-datascience-course-kit && python3 get_data.pyNo clone needed? The same slice is available zero-install over DuckDB-over-HTTP, or by curling just the script — see Get the data for every path.
Guided Run — The Ten-Minute Quick Win
claudeField Assignment
Artifact quick-win transcripts saved to journal/; starter repo cloned
Get hired. By the end you have both the muscle memory of one full loop and the project that the rest of the course advances.
Claude Code
- Install Claude Code, authenticate, and launch
claudein a scratch folder holding onlydata/messy.csv. Confirm the banner reports noCLAUDE.md— you are running the bare loop on purpose. - Run the quick win: clean
data/messy.csvintocleaned_summary.mdandplot.png, approving the cleaning script when prompted. Read the counts per rule before you accept them. - Re-ask for the same task in the other language (Python ↔ R) and confirm the verdicts match.
- Ask the agent to save this run’s summary to
journal/quick-win.md. - Clone the starter repo for Weather and the Demand for Urban Mobility.
Codex
- Install Codex, authenticate, and launch
codexin a scratch folder holding onlydata/messy.csv. Read the banner’s sandbox + approval line — you are running the bare loop on purpose. - Run the quick win: clean
data/messy.csvintocleaned_summary.mdandplot.png, approving the cleaning patch on-request. Read the counts per rule before you accept them. - Re-ask for the same task in the other language (Python ↔ R) and confirm the verdicts match.
- Ask the agent to save this run’s summary to
journal/quick-win.md. - Clone the starter repo for Weather and the Demand for Urban Mobility.
The artifact is the saved transcript and the cloned repo. It feeds A2, where
you stop watching the loop and start directing it — and where the journal/
you just opened becomes a standing discipline.
make check-a1advances A1Day one is the zero-configuration 'before' picture on purpose.
Read the receipts — negative fares, zero-distance paid trips, NULL passenger counts — before accepting.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Accepting the cleaned file without reading the counts. The whole value of the quick win is the receipt — “82 negative fares, 98 zero-distance paid trips” — not the tidy output. A cleaning you cannot describe rule-by-rule is a cleaning you cannot defend in a methods section, and the agent will produce a confident, plausible, undocumented one if you let it.
- [both]
Pasting rows of the CSV into the prompt instead of pointing at the file. It burns context, it loses provenance, and it caps the agent at whatever you happened to copy. Point at the path; let the agent read row 4,217 itself.
- [both]
Treating day one as proof the agent is trustworthy. It is not — it is proof the loop works. Agents misread schemas and invent joins; you watched a clean run, not a guaranteed one. The trust is built across the next five units, not asserted here.
- [both]
〜〜
Clicking through approval prompts without reading them. The pause before a write is the only place you stay in control on day one. An approval you grant reflexively is a file you did not actually authorize — and on a real project that is how
data/raw/gets edited “just this once.”
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
Day one is genuine parity. Both tools install with one command, authenticate through the browser, open a session inside your working directory, and run the same prompt → act → observe → fix loop against the same mangled file to the same cleaned, counted result. The differences are dialect: the login flow, the banner wording, the resume command, and how the pre-write pause is framed — an approval prompt on one side, an approval mode layered over an OS sandbox on the other. Those surfaces diverge more as the course goes on; the loop underneath does not.