The Pain
The research question has been answerable in principle since week two: does weather move demand for taxis? In practice it has been answerable by nobody, because the answer lives in five systems that do not speak to each other. The trips are a heap of monthly Parquet files. The weather is an API that returns hourly JSON if you ask it in exactly the right dialect of date range and forget a column you needed. The events are a city open-data portal with a query language someone designed in 2013. The holidays are a library. The zones are a lookup table and, when you want a map, a shapefile.
A lab would hand this to the data engineer — the person who knows that the weather archive caps at a date range, that the events portal paginates at a thousand rows, that the timestamps come back in three different timezones and one of them is lying. She would build the seams once, cache every raw response so the joins reproduce in February when the API has quietly changed its mind, and hand you a single table you could actually regress on. Then she would go on sabbatical, and the seams would rot, and the next person would rebuild them slightly wrong.
You have spent two days writing glue. The glue works until it doesn’t — a rate limit here, a renamed field there — and every time it breaks, the analysis stops while you re-learn a system you touch twice a year. The data is all there. It is simply not anywhere you can use it.
Why / When
Research data lives in systems — databases, APIs, the open web — and the obstacle is rarely the analysis; it is the reach. MCP (the Model Context Protocol) is a standard socket between an agent and those systems: you register a server once, and its operations appear to the agent as native tools, the way a built-in file-read does. A DuckDB server means the agent queries the warehouse directly instead of shelling out and parsing text; a database server means credentials are held by the connection, not pasted into a prompt.
The honest boundary matters as much as the capability. The agent can
already run a CLI through the shell — B2 had it download 24 months with
plain curl and query DuckDB at a shell prompt. MCP earns its keep
exactly where the shell strains: persistent connections the agent
returns to across many calls without re-establishing state,
credentialed systems where you want the secret in the connection and
not the transcript, and non-CLI services that have no command-line at
all. Where a single shell line already reaches the system, a protocol
around it is ceremony.
In the pipeline, this is the data-acquisition and integration stage, and the role it absorbs is the data engineer: the seams built once, the raw responses cached so the joins reproduce, the warehouse and the analysis panel that turn five systems into one table you can regress on.
Contrary winds
Not for: a system you already reach in one shell line — the agent can run `duckdb` or `curl` directly (B2's lesson); wiring a protocol around a CLI you already have is plumbing nobody asked for.
Mechanics
The shared model first, then registration in each dialect, then the three things the project actually builds: the warehouse, the enrichment joins, and the panel.
The MCP model, and the honest sidebar
An MCP server is a process that exposes a set of operations — query this database, fetch this URL, screenshot this page — over a uniform protocol. The agent discovers them at startup and calls them like any other tool; you register the server once and it is there every session. That is the whole idea, and the discipline is knowing when not to reach for it:
| Reach for MCP when… | Stay with the shell when… |
|---|---|
| the connection is persistent — many calls, shared state | one call, one CLI line does it |
| the system is credentialed — keep the secret in the connection | the data is open, no auth |
| the service is non-CLI — a browser, a SaaS API, a live DB | a curl or duckdb invocation already reaches it |
DuckDB sits right on the line, and it is instructive that it does. The
agent can run duckdb warehouse.duckdb -c "…" at a shell prompt all day.
A DuckDB MCP server earns its place only because the warehouse is a
persistent connection the agent returns to dozens of times in a
session — keeping one handle open beats re-opening the database on every
query and re-parsing text output each time. If your interaction is one
query, the shell is the right tool and the protocol is overhead.
Registering a server
The server you run is identical; the registration surface differs.
Claude Code
Register a server in .mcp.json at the repo root — committed, so it
arrives with the clone — or add it from the command line with
claude mcp add:
{ "mcpServers": { "duckdb": { "command": "uvx", "args": ["mcp-server-motherduck", "--db-path", "warehouse.duckdb"] } }}A project-scoped server registers per project, which is the discipline
that matters: a server added globally loads its tools into every session
you ever run, paying context and startup cost on projects that never
touch a warehouse. Register the DuckDB server in this repo’s .mcp.json,
not your user config, and the next project starts clean.
Codex
Register a server under [mcp_servers] in config.toml — the
project-layer team config, committed with the repo — or add it from the
command line with codex mcp add:
[mcp_servers.duckdb]command = "uvx"args = ["mcp-server-motherduck", "--db-path", "warehouse.duckdb"]Keep the registration in the project-layer config rather than your
personal ~/.codex/config.toml: a server declared globally loads its
tools into every session, paying context and startup cost on projects
that never open a warehouse. Project-scoped registration keeps the next
session you start somewhere else clean.
The warehouse
With the DuckDB server registered, the agent builds the warehouse by
querying the connection directly. The first move is reconnaissance, not
construction: have it explore INFORMATION_SCHEMA so it learns the real
schema from the files instead of assuming one — the B2 plan-first habit,
applied to a live system. Then it materializes the raw table straight
from the TLC CDN with read_parquet — the same public slice the kit
mirrors, which you can also read in place over DuckDB-over-HTTP
(Get the data) — and registers C2’s cleaned table
as a view, so the contracted cleaning SOP is the only path into analysis:
-- 1. Learn the schema from the source, don't assume it.DESCRIBE SELECT * FROM read_parquet('data/raw/yellow_2024-03.parquet');
-- 2. Materialize 24 months; union_by_name absorbs the schema drift C2 maps.CREATE TABLE trips_raw AS SELECT * FROM read_parquet('data/raw/yellow_*.parquet', union_by_name = true);
-- 3. Register C2's cleaning SOP as the only path into analysis.CREATE VIEW trips_clean AS SELECT * FROM standardize(trips_raw);union_by_name = true is the quiet hero: it aligns columns across 24
files that do not all agree on order or presence — a second layer under
C2’s rename map against exactly the casing drift that cost eleven days in
that lesson. Belt, and suspenders.
Enrichment joins, cached raw then derived
The trips are now one table; the question needs four more sources joined
to them. Two come from live systems — Open-Meteo for hourly weather,
the Socrata portal for permitted events — and the discipline for both
is the same and non-negotiable: cache the raw response, then derive.
Every API payload is checksummed into data/raw/api/ before any
transform touches it, because the API you query in June is not the API
that answers in February, and a join you cannot reproduce is a result you
cannot defend — these joins get interrogated by the referee in D4.
Python
import hashlib, json, pathlib, httpx
RAW = pathlib.Path("data/raw/api")
def fetch_weather(start: str, end: str) -> dict: # One call per archive window; the response is the artifact of record. r = httpx.get("https://archive-api.open-meteo.com/v1/archive", params={ "latitude": 40.71, "longitude": -74.01, "start_date": start, "end_date": end, "hourly": "temperature_2m,precipitation,snowfall,wind_speed_10m", "timezone": "America/New_York", # ask for local time explicitly }) r.raise_for_status() body = r.content # Checksum the raw bytes into data/raw/api/ BEFORE any parsing. digest = hashlib.sha256(body).hexdigest()[:16] (RAW / f"weather_{start}_{end}.{digest}.json").write_bytes(body) return json.loads(body) # derive only from what we just cachedThis block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).
R
library(httr2); library(jsonlite); library(digest)
fetch_weather <- function(start, end) { # One call per archive window; the response is the artifact of record. resp <- request("https://archive-api.open-meteo.com/v1/archive") |> req_url_query(latitude = 40.71, longitude = -74.01, start_date = start, end_date = end, hourly = "temperature_2m,precipitation,snowfall,wind_speed_10m", timezone = "America/New_York") |> # ask for local time explicitly req_perform() body <- resp_body_raw(resp) # Checksum the raw bytes into data/raw/api/ BEFORE any parsing. digest16 <- substr(digest(body, algo = "sha256", serialize = FALSE), 1, 16) writeBin(body, file.path("data/raw/api", sprintf("weather_%s_%s.%s.json", start, end, digest16))) fromJSON(rawToChar(body)) # derive only from what we just cached}The events join carries one more trap worth naming: Socrata paginates,
and the holidays come from a library, so weather_hourly, events,
holidays, and zones each land as a cached-raw-then-derived table
before any of them touches the panel. The rule is uniform — nothing
enters the warehouse that was not first written to data/raw/api/ under
its checksum.
Building the analysis panel, and the DST that everyone gets wrong
The panel is a zone × hour grid: one row per zone per hour across the whole window, demand attached, weather and events and holidays joined on. Two properties make it honest, and both are easy to get wrong.
First, zero-demand cells are real data. An hour in which a zone saw
no pickups is a 0, not a missing row — and a 3 a.m. outer-borough zone
is zero far more often than not. Build the grid as a CROSS JOIN of
zones and an hour spine, LEFT JOIN the demand onto it, and coalesce
the nulls to zero; an inner join would silently drop exactly the
low-demand cells a demand model most needs to see.
Second — DST, and not the trap you were warned about. The folklore says spring-forward creates phantom timestamps. It does not, at least not here: the TLC meters got the 2024-03-10 spring-forward right, and zero trips are stamped inside the 02:00 hour that the local clock skips. The real trap is two-headed:
-
The hour spine. America/New_York has 743 local hours in March 2024, not 744 — the 02:00 hour does not exist. A naive spine of
31 × 24invents a phantom hour no trip can ever fill, and your panel carries a row of structural zeros that are not zeros, they are nothing. Build the spine in local time and it self-corrects: 743 hours, matching the data. -
Gap-spanning durations. This is the one that bites the duration workstream. A trip from 01:03 to 03:01 on spring-forward night reads 118 clock-minutes by naive subtraction — but only 58 minutes actually elapsed; the 02:00 hour never happened. Compute durations from wall-clock fields and every gap-spanning trip is overstated by exactly sixty minutes. Compute them in UTC (or from epoch differences) and they are true. The panel’s spine lives in local time — hour-of-day effects are the research question — but durations live in UTC. Each timezone job gets its own correct answer; using one answer for both is the silent bug.
the numbers behind this figure
dst_window 35 rows
SELECT ts_local, ts_utc, sum(pickups) AS pickups
FROM panel_zone_hour
WHERE ts_local >= TIMESTAMP '2024-03-09 12:00'
AND ts_local <= TIMESTAMP '2024-03-10 23:00'
GROUP BY 1, 2 ORDER BY 1 phantom_hour_trips count = 0
SELECT count(*) FROM trips_raw WHERE tpep_pickup_datetime >= '2024-03-10 02:00' AND tpep_pickup_datetime < '2024-03-10 03:00' honesty note March has 743 panel hours, not 744. No trips were stamped inside the phantom hour — the meters got DST right even though the analyst often doesn't.
Watch the panel materialize, the snowstorm and all. The heatmap below is
one week of the real panel — Manhattan, hour × day — straight from
analysis_panel with the zero cells included. The trench cut through the
middle of the week is the February 13 snowstorm, which dropped 19 cm and
cut Manhattan demand by 45 percent. (Rain, counterintuitively, cuts the
other way: heavy-rain hours run about 62 percent above baseline — when
it pours, people who would have walked take a cab. The weather scatter in
Unit D makes that case in full.)
the numbers behind this figure
heat 168 rows
SELECT date_trunc('day', ts_local) AS day, hour(ts_local) AS hh,
sum(pickups) AS pickups
FROM panel_zone_hour
WHERE borough = 'Manhattan'
AND ts_local >= TIMESTAMP '2024-02-12'
AND ts_local < TIMESTAMP '2024-02-19'
GROUP BY 1, 2 ORDER BY 1, 2 snow_feb13_cm value = 19.11
Sources without an API — the in-app browser
Most of the project’s sources have an API or a CLI. The day one does not — a permit calendar that exists only as a rendered web page, a portal that hides its data behind a form — you need the agent to see a page, not fetch it. Here the tools genuinely diverge, so this is a spotlight, not a tab: one tool owns the capability and the other reaches it by a labeled workaround.
Codex Your tool
In-app browser / computer use
Codex can drive a real browser inside the session — load a page, read
what renders, fill a form, click through, and capture what it sees
(Appshots). For a source that publishes only as HTML — a borough permit
calendar with no JSON endpoint — the agent navigates it the way you
would, extracts the rows, and (the C3 discipline holds) writes the raw
captured page into data/raw/api/ under a checksum before deriving an
events supplement from it. The browsing is native: no separate server
to register, the page is just another surface the agent can act on.
Nearest equivalent — Claude Code
Claude Code reaches the same capability through MCP itself: register the
Playwright MCP server and the agent gains tools to open a page,
read the rendered DOM, screenshot it, and interact. It is the same job —
see a page that has no API, extract the rows, cache the raw capture under
a checksum, derive the events supplement — reached through a server you
register rather than a capability built into the session. The seam is the
registration: one more entry in .mcp.json, and the agent that builds
your warehouse can also read the web pages your warehouse needs.
Guided Run — Building the Warehouse
claudeField Assignment
Artifact make check-c3 passes
Build the warehouse, wire the enrichment, and materialize the analysis panel — all of it under the B3 pipeline profile and the C2 contracts, so every write that builds the panel passes the gate that protects it.
Claude Code
- Register the DuckDB server in
.mcp.json(project-scoped) and confirm its tools load at session start. - Have the agent explore
INFORMATION_SCHEMA, then buildwarehouse.duckdbfrom the TLC CDN viaread_parquetand register C2’strips_cleanas a view. - Build
weather_hourly,events,holidays, andzones— each cached-raw-then-derived, every API response checksummed intodata/raw/api/before any transform. - Build
analysis_panel: a complete zone × hour grid, zero-cells included viaLEFT JOIN+coalesce, with the DST handling above — a 743-hour local spine for March, durations computed in UTC. make check-c3.
Codex
- Register the DuckDB server under
[mcp_servers]inconfig.toml(project-scoped) and confirm its tools load at session start. - Have the agent explore
INFORMATION_SCHEMA, then buildwarehouse.duckdbfrom the TLC CDN viaread_parquetand register C2’strips_cleanas a view. - Build
weather_hourly,events,holidays, andzones— each cached-raw-then-derived, every API response checksummed intodata/raw/api/before any transform. - Build
analysis_panel: a complete zone × hour grid, zero-cells included viaLEFT JOIN+coalesce, with the DST handling above — a 743-hour local spine for March, durations computed in UTC. make check-c3.
make check-c3 validates schemas, zero-cell completeness, the DST hour
counts (743 for March, 745 for November), and that every enrichment table
has a cached raw response under its checksum. This panel is what D1–D4
estimate on — every elasticity, every event study, every robustness spec
reads from the table you build here, which is why its zero-cells and its
DST handling have to be right exactly once.
make check-c3advances C3The agent explores INFORMATION_SCHEMA first — it should learn the schema, not assume it.
Every API response checksummed into data/raw/api/ BEFORE any transform — cache-raw-then-derive.
An hour with no pickups is a real 0, not a missing row; left-join the grid, coalesce to 0.
March has 743 panel hours, not 744; a gap-spanning trip's clock minutes overstate its true minutes.
C3 builds on C2's gate — the panel is only worth plumbing because every write passed it.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Deriving from an API response without caching the raw payload first. The Open-Meteo archive and the Socrata portal both revise — silently, on their own schedule — and a join you built against last month’s response cannot be reproduced against this month’s. Checksum the raw bytes into
data/raw/api/before any parse; the cached payload is the artifact of record the referee will ask for in D4. - [both]
〜〜
Using one timezone answer for two timezone jobs. A local-hour spine is correct for hour-of-day demand and wrong for trip durations; a UTC duration is correct for elapsed time and wrong for “what hour was this.” The gap-spanning trip that reads 118 clock-minutes but elapsed 58 is the classic silent bug — the panel’s DST checks exist to catch exactly it.
- [both]
Building the panel with an inner join and losing the zeros. An hour a zone saw no pickups is a
0, and a demand model that never sees its zeros is estimating on a truncated sample without knowing it. Grid first, demand left-joined, nulls coalesced to zero — the completeness is the data, not a formatting nicety. - [both]
〜〜
MCP servers with write access to systems you care about. A server that can write your warehouse is a server that can corrupt it; scope its credentials per B3’s least-privilege profiles, and prefer a read-only connection for any server whose only job is to be queried. The protocol’s convenience is not a reason to hand it your keys.
- [CC]
A globally registered server piles its tools into every session. The DuckDB or Playwright server you add to your user config loads — and costs context and startup — on every project, including the ones with no warehouse and no web sources. Register servers per project in the repo’s
.mcp.json, and the next project starts clean.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
MCP is a genuine parity feature: both tools speak the same protocol and
register servers the same way in substance, differing only in surface —
.mcp.json / claude mcp add on one side, [mcp_servers] in
config.toml / codex mcp add on the other. The asymmetry is at the
edge of the lesson, not its center: seeing a web page that has no API is
native to Codex (its in-app browser and computer use) and reached in
Claude Code through MCP itself (the Playwright server) — same job, one
built in and one registered. Both converge quickly; the warehouse and
the enrichment joins, which are the milestone, are identical either way.