Get the data
Every lesson analyses one fixed slice of NYC yellow-taxi trips + NYC weather (~170 MB, all public, no API keys). Get it three ways — pick whichever suits you.
- 1
One command recommended
Python 3.9+, nothing to
pip install. Clone the starter kit and run it — downloads the slice into./data/and verifies every file against the pinned course checksums, so you have byte-for-byte the data the figures were drawn from. Safe to re-run.git clone https://github.com/junwei-lu/agentic-datascience-course-kit.git cd agentic-datascience-course-kit python3 get_data.pyDon’t want to clone? Grab just the script and run it anywhere:
curl -O https://junwei-lu.github.io/agentic-datascience-course/get_data.pythenpython3 get_data.py(--checkverifies an existing copy). - 2
No download — query it in place
If you have DuckDB, read the Parquet straight from its public URL. DuckDB range-reads, pulling only the columns and rows it needs — perfect for the read-only exploration lessons (and exactly the C3 warehouse idea).
duckdb -c "SELECT count(*) FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-03.parquet'" - 3
Let the agent do it
The course's own subject. The starter kit ships a
CLAUDE.md/AGENTS.mdthat already knows this dataset. Open the folder in Claude Code or Codex and ask:Set up the course data, then verify it's intact.
What lands in ./data/
| File | What it is | Used by |
|---|---|---|
yellow_tripdata_2024-02.parquet | trips · Feb 2024 | Unit A cleaning · Unit C drift pair |
yellow_tripdata_2024-03.parquet | trips · Mar 2024 | Unit C drift pair · the zone-hour panel |
yellow_tripdata_2024-06.parquet | trips · Jun 2024 | a clean summer month (Units D–F) |
taxi_zone_lookup.csv | LocationID → zone/borough | the C3 warehouse join |
taxi_zones.zip | zone geometry (shapefile) | the Atlas of Agents map |
weather_hourly_nyc.json | Open-Meteo hourly weather | the demand model |
~10.1 M raw trip rows. One detail is a real lesson, not a trap: the airport_fee column
is lower-case in February but Airport_fee from March on — any query spanning months must
normalise the casing. Unit C walks through catching it.
Sources: NYC TLC Trip Record Data (public, redistributable); weather by Open-Meteo.com (CC BY 4.0). The kit is part of the course materials.