← The map

Get the data

Every lesson analyses one fixed slice of NYC yellow-taxi trips + NYC weather (~170 MB, all public, no API keys). Get it three ways — pick whichever suits you.

  1. 1

    One command recommended

    Python 3.9+, nothing to pip install. Clone the starter kit and run it — downloads the slice into ./data/ and verifies every file against the pinned course checksums, so you have byte-for-byte the data the figures were drawn from. Safe to re-run.

    git clone https://github.com/junwei-lu/agentic-datascience-course-kit.git
    cd agentic-datascience-course-kit
    python3 get_data.py

    Don’t want to clone? Grab just the script and run it anywhere: curl -O https://junwei-lu.github.io/agentic-datascience-course/get_data.py then python3 get_data.py (--check verifies an existing copy).

  2. 2

    No download — query it in place

    If you have DuckDB, read the Parquet straight from its public URL. DuckDB range-reads, pulling only the columns and rows it needs — perfect for the read-only exploration lessons (and exactly the C3 warehouse idea).

    duckdb -c "SELECT count(*) FROM 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-03.parquet'"
  3. 3

    Let the agent do it

    The course's own subject. The starter kit ships a CLAUDE.md / AGENTS.md that already knows this dataset. Open the folder in Claude Code or Codex and ask:

    Set up the course data, then verify it's intact.

What lands in ./data/

FileWhat it isUsed by
yellow_tripdata_2024-02.parquet trips · Feb 2024 Unit A cleaning · Unit C drift pair
yellow_tripdata_2024-03.parquet trips · Mar 2024 Unit C drift pair · the zone-hour panel
yellow_tripdata_2024-06.parquet trips · Jun 2024 a clean summer month (Units D–F)
taxi_zone_lookup.csv LocationID → zone/borough the C3 warehouse join
taxi_zones.zip zone geometry (shapefile) the Atlas of Agents map
weather_hourly_nyc.json Open-Meteo hourly weather the demand model

~10.1 M raw trip rows. One detail is a real lesson, not a trap: the airport_fee column is lower-case in February but Airport_fee from March on — any query spanning months must normalise the casing. Unit C walks through catching it.

Sources: NYC TLC Trip Record Data (public, redistributable); weather by Open-Meteo.com (CC BY 4.0). The kit is part of the course materials.