A1 beginner ~20 min

Day One: Your First Agent

Advances A1

The Pain

The file came down at midnight, the way these things do, and by one in the morning you had it open and you understood the shape of the next three days. Three and a half million rows of yellow-cab trips, and somewhere in the column called fare there was a charge of negative eight hundred dollars. A trip that covered a hundred and seventy-six thousand miles in thirteen minutes. A pickup timestamped to the last day of 2002, nineteen years before the meter that recorded it existed. Whole rows where the passenger count was blank — not zero, blank — and you had no way yet to know whether that was four hundred thousand rows or forty.

You are the whole lab. You are the methodologist who will eventually estimate how weather moves the demand for rides, and you are also the person who has to sit here at one in the morning deciding, by hand, what counts as a real trip. Nobody hands a graduate student a clean dataset. The cleaning is the dissertation’s foundation, and it is unglamorous, and it is yours, and it is the wall every empirical project hits first — the days that vanish before the first honest plot, the work no methods section ever describes because it is assumed and never done. You make coffee. You start writing the same defensive parsing code you have written for three previous projects, knowing that by the time it works you will have forgotten why each rule is there. The wall is not the analysis. The wall is everything between you and the analysis, and tonight, as usual, you are facing it alone.

Why / When

An agentic command-line tool is a language model given four things a chat window withholds: a filesystem it can read and write, a shell it can run commands in, persistence so it remembers the project across a session, and a set of tools it chooses among on its own. That combination is a difference in kind, not degree. A chat window can advise you about data it has never touched. An agent reads row 4,217, runs the cleaning script, sees the traceback, and fixes it — the loop this whole course is built on: prompt → act → observe → fix.

Two tools teach this course, and they share that loop while differing in temperament. One is built around a local session and small composable pieces you assemble yourself; the other leans toward delegating whole tasks, with first-class cloud runs. The concepts transfer; only the dialect changes — which is exactly why you learn both. And you learn the honest limits up front, because they are real: agents misread schemas, invent joins that look right, and will happily optimize a metric into meaninglessness if you let them. Unit C answers the first with enforcement; Unit D answers the last with adversarial review. Today is none of that. Today you watch the bare loop work, so you know what you are later making trustworthy.

In the research pipeline this is the very first stage — data cleaning and first contact — and the lab role it absorbs is no single person. It is the wall itself: the unstaffed midnight hours between a raw file and a defensible first plot.

Contrary winds

Not for: a number you can get from one line of SQL you already know — opening an agent to compute a single mean is ceremony, not leverage.

Mechanics

Today is deliberately configuration-free. You install one of the two tools, authenticate, and point it at a mangled file. No instruction files, no settings, no skills — those arrive across Units B and C. The bare loop first.

What these tools are

Both tools are the same animal: a model with hands. You type a request in plain language; the model plans, calls a tool (read a file, run a command, write a patch), reads what came back, and decides what to do next — looping until the work is done or it needs you. Two controls matter on day one and are common to both tools, under different names:

An approval prompt / approval mode — the agent pauses before a consequential action (writing a file, installing a package, running a destructive command) and waits for your y/n. This is your hand on the tiller. Deny anything you do not understand.
A model and reasoning-effort setting — which model drives, and how hard it thinks. The default is fine today; A2 makes this a daily habit.

Before you run anything, watch one full turn of the loop in slow motion. This is the “what just happened” view of everything you are about to see scroll past — the sentence becoming a tool call becoming an observation becoming the next decision.

System Player film — One Agent Turn

step 1/6

Step 1 of 6.

It starts as a sentence, not a script. You describe the artifact you want — clean the February file and plot the fare distribution — and press enter. That sentence is the whole program.

That single turn, repeated until the work is done, is the entire mechanism. Everything else in this course makes that turn safer, cheaper, or more trustworthy.

Install and authenticate

Pick the tool you will follow the course in — you can install the other later; the concepts are identical. Install, authenticate, and read the opening banner, because the banner tells you the two things that matter: that you are running unconfigured, and what the agent may do without asking.

Claude Code

npm install -g @anthropic-ai/claude-code
cd scratch/day-one          # a throwaway folder holding only data/messy.csv
claude                      # opens a session; first run walks you through login

The first launch sends you to the browser to authenticate, then drops you at a prompt inside the current directory. The banner notes there is no CLAUDE.md here — no project instructions — so the agent is running on defaults. That is the point of day one. When the agent later wants to write a file, you will see an approval prompt like Apply edit to clean_taxi.py? (y/n); that pause is where you stay in control. claude --resume brings a past session back; you will not need it today.

Codex

npm install -g @openai/codex
cd scratch/day-one          # a throwaway folder holding only data/messy.csv
codex                       # opens a session; first run walks you through sign-in

The first launch authenticates you (browser sign-in or an API key), then opens a session in the current directory. The banner reports its sandbox and approval mode — typically sandbox: workspace-write · approvals: on-request. Read that pair literally: the sandbox bounds where the agent may act (this directory tree, not your whole machine), and the approval mode sets when it consults you (on consequential actions, like writing clean_taxi.py). Two independent dials; B3 turns both into real safety profiles. codex exec resume continues a run headlessly; not needed today.

The dialect differs — login flow, banner wording, the resume command — but the session you are now sitting in front of is the same loop in both.

The quick win

Here is the file the agent is about to meet. These are not invented horrors; every row is verbatim from the raw 2024 yellow-cab files — the wall from the Pain vignette, made concrete.

What the meter actually wrote down — Seven verbatim rows from the raw 2024-02/2024-03 yellow files: a 2002 timestamp, a −$800 fare, a 0-second trip, a 176,836-mile odometer reading, a NULL passenger count, and a trip spanning the nonexistent 02:00 DST hour.

Now point the agent at it. The discipline, which A2 will name formally, is already visible in the prompt: you point at the file, you do not paste its rows; and you demand artifacts — a written summary and a plot on disk, not a verdict in the scrollback.

Claude Code

> Clean data/messy.csv. Count every problem you find, drop rows by
  documented rules, and write a cleaned summary to cleaned_summary.md
  plus one plot of trips by hour to plot.png.

The agent reads the head of the file first (a few hundred tokens reveal the delimiters, the dtypes, and the first specimens of trouble), proposes a small cleaning script, and pauses for your approval before writing it. You approve; it runs; it reports counts per rule. The interactive run below is that turn by turn — drive it yourself.

Codex

> Clean data/messy.csv. Count every problem you find, drop rows by
  documented rules, and write a cleaned summary to cleaned_summary.md
  plus one plot of trips by hour to plot.png.

The agent reads the head of the file first (a few hundred tokens reveal the delimiters, the dtypes, and the first specimens of trouble), proposes a small cleaning script, and pauses on-request before writing it. You approve; it runs; it reports counts per rule. The prompt is identical to the Claude Code tab — same destination, same artifacts; only the approval surface around the write differs.

What you get back is the same file, cleaned and counted — every removal named, the worst offenders quoted, the survivors plotted:

The same month, after the documented cascade — 2024-03 cleaning ledger (3,582,628 → 3,521,703 rows, every removal counted) beside the cleaned Manhattan demand curve by hour of day.

raw	3,582,628
s1	3,582,605
s2	3,524,141
s3	3,523,019
s4	3,521,703

The receipts matter more than the plot. The agent did not silently delete the − $800 fare; it reported "82 negative fares (worst: −$ 800.00 on a 0.00-mile trip)” and left a script you can read and defend. That is the difference between an edit and a finding.

Ask in Python, then in R

The lab is bilingual, and the agent does not care which language it works in. Re-ask for the same task in the other language and watch the verdicts come back identical — the same counts, rule for rule, with dplyr filtering where pandas masked.

Python

> Now do the same cleaning task again, this time in Python — same drop
  rules, same counted summary.

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

> Now do the same cleaning task again, this time in R — same drop rules,
  same counted summary.

This is the language policy, stated once for the whole course: your statistics live in Python or R, and the agentic skills you are learning transfer untouched between them. The R toggle in this site’s header works the same way — flip it and the statistical code rewrites; the lesson does not. This course teaches the tools, not the languages.

The research project

Everything from here builds one project: Weather and the Demand for Urban Mobility. The question is plain — when the weather turns, who still rides? — and the answer is a report. The data is twenty-four months of New York yellow and green taxi trips joined to weather; the deliverable is a reproducible estimate of how rain, snow, and heat move demand across the city’s zones. The messy file you just cleaned is one sample month of it.

Clone the starter kit and you have the project’s skeleton — the directory contract every later lesson assumes, an empty journal/ for the receipts you are about to start keeping, and a Makefile whose make check-a1 … check-f1 targets are the milestones you will tick off one unit at a time. The one command python3 get_data.py then fetches the fixed course slice into ./data/: the 2024 yellow-taxi parquet months, the zone lookup, and the NYC hourly weather every later lesson is built on.

git clone https://github.com/junwei-lu/agentic-datascience-course-kit.git
cd agentic-datascience-course-kit && python3 get_data.py

No clone needed? The same slice is available zero-install over DuckDB-over-HTTP, or by curling just the script — see Get the data for every path.

Guided Run — The Ten-Minute Quick Win

Field Terminal — session: a1-quick-win Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact quick-win transcripts saved to journal/; starter repo cloned

Get hired. By the end you have both the muscle memory of one full loop and the project that the rest of the course advances.

Claude Code

Install Claude Code, authenticate, and launch claude in a scratch folder holding only data/messy.csv. Confirm the banner reports no CLAUDE.md — you are running the bare loop on purpose.
Run the quick win: clean data/messy.csv into cleaned_summary.md and plot.png, approving the cleaning script when prompted. Read the counts per rule before you accept them.
Re-ask for the same task in the other language (Python ↔ R) and confirm the verdicts match.
Ask the agent to save this run’s summary to journal/quick-win.md.
Clone the starter repo for Weather and the Demand for Urban Mobility.

Codex

Install Codex, authenticate, and launch codex in a scratch folder holding only data/messy.csv. Read the banner’s sandbox + approval line — you are running the bare loop on purpose.
Run the quick win: clean data/messy.csv into cleaned_summary.md and plot.png, approving the cleaning patch on-request. Read the counts per rule before you accept them.
Re-ask for the same task in the other language (Python ↔ R) and confirm the verdicts match.
Ask the agent to save this run’s summary to journal/quick-win.md.
Clone the starter repo for Weather and the Demand for Urban Mobility.

The artifact is the saved transcript and the cloned repo. It feeds A2, where you stop watching the loop and start directing it — and where the journal/ you just opened becomes a standing discipline.

Milestone gate · make check-a1advances A1

One tool installed and authenticated; the bare session launches with no CLAUDE.md / no project config
Day one is the zero-configuration 'before' picture on purpose.
The quick win ran: data/messy.csv → cleaned_summary.md + plot.png, with counts reported per drop rule
Read the receipts — negative fares, zero-distance paid trips, NULL passenger counts — before accepting.
The same task re-run in the other language (Python ↔ R), verdicts confirmed identical
journal/quick-win.md saved — the transcript of what was asked, dropped, and produced
The starter repo for Weather and the Demand for Urban Mobility cloned

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Accepting the cleaned file without reading the counts. The whole value of the quick win is the receipt — “82 negative fares, 98 zero-distance paid trips” — not the tidy output. A cleaning you cannot describe rule-by-rule is a cleaning you cannot defend in a methods section, and the agent will produce a confident, plausible, undocumented one if you let it.
[both]

Pasting rows of the CSV into the prompt instead of pointing at the file. It burns context, it loses provenance, and it caps the agent at whatever you happened to copy. Point at the path; let the agent read row 4,217 itself.
[both]

Treating day one as proof the agent is trustworthy. It is not — it is proof the loop works. Agents misread schemas and invent joins; you watched a clean run, not a guaranteed one. The trust is built across the next five units, not asserted here.
[both] 〜〜

Clicking through approval prompts without reading them. The pause before a write is the only place you stay in control on day one. An approval you grant reflexively is a file you did not actually authorize — and on a real project that is how data/raw/ gets edited “just this once.”

Check Your Bearings

A1 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Record the quick win: which problems the agent found in the mangled file, the count it reported for each, and which language each pass used.

as of June 2026

Day one is genuine parity. Both tools install with one command, authenticate through the browser, open a session inside your working directory, and run the same prompt → act → observe → fix loop against the same mangled file to the same cleaned, counted result. The differences are dialect: the login flow, the banner wording, the resume command, and how the pre-write pause is framed — an approval prompt on one side, an approval mode layered over an OS sandbox on the other. Those surfaces diverge more as the course goes on; the loop underneath does not.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

Day One: Your First Agent

The Pain

Why / When

Mechanics

What these tools are

Install and authenticate

Claude Code

Codex

The quick win

ok 1 row

misdated 1 row

neg_fare 1 row

zero_sec 1 row

speed 1 row

null_pass 1 row

dst 1 row

Claude Code

Codex

march_cascade

manhattan_hourly 24 rows · 3,148,474 trips total

Ask in Python, then in R

Python

R

The research project

Guided Run — The Ten-Minute Quick Win

Field Assignment

Claude Code

Codex

Pitfalls & Gotchas

Check Your Bearings

Ledger — A1

The Lab Roster

Your position

Positions

Running Totals

The Pain

Why / When

Mechanics

What these tools are

Install and authenticate

✳ Claude Code

⬡ Codex

The quick win

ok 1 row

misdated 1 row

neg_fare 1 row

zero_sec 1 row

speed 1 row

null_pass 1 row

dst 1 row

✳ Claude Code

⬡ Codex

march_cascade

manhattan_hourly 24 rows · 3,148,474 trips total

Ask in Python, then in R

Python

R

The research project

Guided Run — The Ten-Minute Quick Win

✳ Claude Code

⬡ Codex

Pitfalls & Gotchas

Parity note

Claude Code

Codex

Claude Code

Codex

Claude Code

Codex