C3 intermediate ~60 min

Plugging Into Data Systems

Absorbs: the data engineer

Advances C3

The Pain

The research question has been answerable in principle since week two: does weather move demand for taxis? In practice it has been answerable by nobody, because the answer lives in five systems that do not speak to each other. The trips are a heap of monthly Parquet files. The weather is an API that returns hourly JSON if you ask it in exactly the right dialect of date range and forget a column you needed. The events are a city open-data portal with a query language someone designed in 2013. The holidays are a library. The zones are a lookup table and, when you want a map, a shapefile.

A lab would hand this to the data engineer — the person who knows that the weather archive caps at a date range, that the events portal paginates at a thousand rows, that the timestamps come back in three different timezones and one of them is lying. She would build the seams once, cache every raw response so the joins reproduce in February when the API has quietly changed its mind, and hand you a single table you could actually regress on. Then she would go on sabbatical, and the seams would rot, and the next person would rebuild them slightly wrong.

You have spent two days writing glue. The glue works until it doesn’t — a rate limit here, a renamed field there — and every time it breaks, the analysis stops while you re-learn a system you touch twice a year. The data is all there. It is simply not anywhere you can use it.

Why / When

Research data lives in systems — databases, APIs, the open web — and the obstacle is rarely the analysis; it is the reach. MCP (the Model Context Protocol) is a standard socket between an agent and those systems: you register a server once, and its operations appear to the agent as native tools, the way a built-in file-read does. A DuckDB server means the agent queries the warehouse directly instead of shelling out and parsing text; a database server means credentials are held by the connection, not pasted into a prompt.

The honest boundary matters as much as the capability. The agent can already run a CLI through the shell — B2 had it download 24 months with plain curl and query DuckDB at a shell prompt. MCP earns its keep exactly where the shell strains: persistent connections the agent returns to across many calls without re-establishing state, credentialed systems where you want the secret in the connection and not the transcript, and non-CLI services that have no command-line at all. Where a single shell line already reaches the system, a protocol around it is ceremony.

In the pipeline, this is the data-acquisition and integration stage, and the role it absorbs is the data engineer: the seams built once, the raw responses cached so the joins reproduce, the warehouse and the analysis panel that turn five systems into one table you can regress on.

Contrary winds

Not for: a system you already reach in one shell line — the agent can run `duckdb` or `curl` directly (B2's lesson); wiring a protocol around a CLI you already have is plumbing nobody asked for.

Mechanics

The shared model first, then registration in each dialect, then the three things the project actually builds: the warehouse, the enrichment joins, and the panel.

An MCP server is a process that exposes a set of operations — query this database, fetch this URL, screenshot this page — over a uniform protocol. The agent discovers them at startup and calls them like any other tool; you register the server once and it is there every session. That is the whole idea, and the discipline is knowing when not to reach for it:

Reach for MCP when…	Stay with the shell when…
the connection is persistent — many calls, shared state	one call, one CLI line does it
the system is credentialed — keep the secret in the connection	the data is open, no auth
the service is non-CLI — a browser, a SaaS API, a live DB	a `curl` or `duckdb` invocation already reaches it

DuckDB sits right on the line, and it is instructive that it does. The agent can run duckdb warehouse.duckdb -c "…" at a shell prompt all day. A DuckDB MCP server earns its place only because the warehouse is a persistent connection the agent returns to dozens of times in a session — keeping one handle open beats re-opening the database on every query and re-parsing text output each time. If your interaction is one query, the shell is the right tool and the protocol is overhead.

Registering a server

The server you run is identical; the registration surface differs.

Claude Code

Register a server in .mcp.json at the repo root — committed, so it arrives with the clone — or add it from the command line with claude mcp add:

{
  "mcpServers": {
    "duckdb": {
      "command": "uvx",
      "args": ["mcp-server-motherduck", "--db-path", "warehouse.duckdb"]
    }
  }
}

A project-scoped server registers per project, which is the discipline that matters: a server added globally loads its tools into every session you ever run, paying context and startup cost on projects that never touch a warehouse. Register the DuckDB server in this repo’s .mcp.json, not your user config, and the next project starts clean.

Codex

Register a server under [mcp_servers] in config.toml — the project-layer team config, committed with the repo — or add it from the command line with codex mcp add:

[mcp_servers.duckdb]
command = "uvx"
args = ["mcp-server-motherduck", "--db-path", "warehouse.duckdb"]

Keep the registration in the project-layer config rather than your personal ~/.codex/config.toml: a server declared globally loads its tools into every session, paying context and startup cost on projects that never open a warehouse. Project-scoped registration keeps the next session you start somewhere else clean.

The warehouse

With the DuckDB server registered, the agent builds the warehouse by querying the connection directly. The first move is reconnaissance, not construction: have it explore INFORMATION_SCHEMA so it learns the real schema from the files instead of assuming one — the B2 plan-first habit, applied to a live system. Then it materializes the raw table straight from the TLC CDN with read_parquet — the same public slice the kit mirrors, which you can also read in place over DuckDB-over-HTTP (Get the data) — and registers C2’s cleaned table as a view, so the contracted cleaning SOP is the only path into analysis:

-- 1. Learn the schema from the source, don't assume it.
DESCRIBE SELECT * FROM read_parquet('data/raw/yellow_2024-03.parquet');

-- 2. Materialize 24 months; union_by_name absorbs the schema drift C2 maps.
CREATE TABLE trips_raw AS
  SELECT * FROM read_parquet('data/raw/yellow_*.parquet', union_by_name = true);

-- 3. Register C2's cleaning SOP as the only path into analysis.
CREATE VIEW trips_clean AS SELECT * FROM standardize(trips_raw);

union_by_name = true is the quiet hero: it aligns columns across 24 files that do not all agree on order or presence — a second layer under C2’s rename map against exactly the casing drift that cost eleven days in that lesson. Belt, and suspenders.

Enrichment joins, cached raw then derived

The trips are now one table; the question needs four more sources joined to them. Two come from live systems — Open-Meteo for hourly weather, the Socrata portal for permitted events — and the discipline for both is the same and non-negotiable: cache the raw response, then derive. Every API payload is checksummed into data/raw/api/ before any transform touches it, because the API you query in June is not the API that answers in February, and a join you cannot reproduce is a result you cannot defend — these joins get interrogated by the referee in D4.

Python

import hashlib, json, pathlib, httpx

RAW = pathlib.Path("data/raw/api")

def fetch_weather(start: str, end: str) -> dict:
    # One call per archive window; the response is the artifact of record.
    r = httpx.get("https://archive-api.open-meteo.com/v1/archive", params={
        "latitude": 40.71, "longitude": -74.01, "start_date": start,
        "end_date": end, "hourly": "temperature_2m,precipitation,snowfall,wind_speed_10m",
        "timezone": "America/New_York",      # ask for local time explicitly
    })
    r.raise_for_status()
    body = r.content
    # Checksum the raw bytes into data/raw/api/ BEFORE any parsing.
    digest = hashlib.sha256(body).hexdigest()[:16]
    (RAW / f"weather_{start}_{end}.{digest}.json").write_bytes(body)
    return json.loads(body)        # derive only from what we just cached

This block is orchestration, not statistics — it’s the same in R. Ask the agent to translate (Lesson A1).

R

library(httr2); library(jsonlite); library(digest)

fetch_weather <- function(start, end) {
  # One call per archive window; the response is the artifact of record.
  resp <- request("https://archive-api.open-meteo.com/v1/archive") |>
    req_url_query(latitude = 40.71, longitude = -74.01,
                  start_date = start, end_date = end,
                  hourly = "temperature_2m,precipitation,snowfall,wind_speed_10m",
                  timezone = "America/New_York") |>    # ask for local time explicitly
    req_perform()
  body <- resp_body_raw(resp)
  # Checksum the raw bytes into data/raw/api/ BEFORE any parsing.
  digest16 <- substr(digest(body, algo = "sha256", serialize = FALSE), 1, 16)
  writeBin(body, file.path("data/raw/api",
           sprintf("weather_%s_%s.%s.json", start, end, digest16)))
  fromJSON(rawToChar(body))          # derive only from what we just cached
}

The events join carries one more trap worth naming: Socrata paginates, and the holidays come from a library, so weather_hourly, events, holidays, and zones each land as a cached-raw-then-derived table before any of them touches the panel. The rule is uniform — nothing enters the warehouse that was not first written to data/raw/api/ under its checksum.

Building the analysis panel, and the DST that everyone gets wrong

The panel is a zone × hour grid: one row per zone per hour across the whole window, demand attached, weather and events and holidays joined on. Two properties make it honest, and both are easy to get wrong.

First, zero-demand cells are real data. An hour in which a zone saw no pickups is a 0, not a missing row — and a 3 a.m. outer-borough zone is zero far more often than not. Build the grid as a CROSS JOIN of zones and an hour spine, LEFT JOIN the demand onto it, and coalesce the nulls to zero; an inner join would silently drop exactly the low-demand cells a demand model most needs to see.

Second — DST, and not the trap you were warned about. The folklore says spring-forward creates phantom timestamps. It does not, at least not here: the TLC meters got the 2024-03-10 spring-forward right, and zero trips are stamped inside the 02:00 hour that the local clock skips. The real trap is two-headed:

The hour spine. America/New_York has 743 local hours in March 2024, not 744 — the 02:00 hour does not exist. A naive spine of 31 × 24 invents a phantom hour no trip can ever fill, and your panel carries a row of structural zeros that are not zeros, they are nothing. Build the spine in local time and it self-corrects: 743 hours, matching the data.
Gap-spanning durations. This is the one that bites the duration workstream. A trip from 01:03 to 03:01 on spring-forward night reads 118 clock-minutes by naive subtraction — but only 58 minutes actually elapsed; the 02:00 hour never happened. Compute durations from wall-clock fields and every gap-spanning trip is overstated by exactly sixty minutes. Compute them in UTC (or from epoch differences) and they are true. The panel’s spine lives in local time — hour-of-day effects are the research question — but durations live in UTC. Each timezone job gets its own correct answer; using one answer for both is the silent bug.

The night an hour vanished — Citywide hourly pickups around the 2024-03-10 spring-forward: by the local clock there is a hole at 02:00 (the hour does not exist, and contains 0 trips); in UTC the same night is seamless.

Watch the panel materialize, the snowstorm and all. The heatmap below is one week of the real panel — Manhattan, hour × day — straight from analysis_panel with the zero cells included. The trench cut through the middle of the week is the February 13 snowstorm, which dropped 19 cm and cut Manhattan demand by 45 percent. (Rain, counterintuitively, cuts the other way: heavy-rain hours run about 62 percent above baseline — when it pours, people who would have walked take a cab. The weather scatter in Unit D makes that case in full.)

One week of the panel: Manhattan, hour × day — Demand by hour × day for Manhattan, week of 2024-02-12, straight from panel_zone_hour (zero cells included). The Feb 13 snowstorm (19.1 cm) cuts a visible trench through the week.

Sources without an API — the in-app browser

Most of the project’s sources have an API or a CLI. The day one does not — a permit calendar that exists only as a rendered web page, a portal that hides its data behind a form — you need the agent to see a page, not fetch it. Here the tools genuinely diverge, so this is a spotlight, not a tab: one tool owns the capability and the other reaches it by a labeled workaround.

Codex Your tool

In-app browser / computer use

Codex can drive a real browser inside the session — load a page, read what renders, fill a form, click through, and capture what it sees (Appshots). For a source that publishes only as HTML — a borough permit calendar with no JSON endpoint — the agent navigates it the way you would, extracts the rows, and (the C3 discipline holds) writes the raw captured page into data/raw/api/ under a checksum before deriving an events supplement from it. The browsing is native: no separate server to register, the page is just another surface the agent can act on.

Nearest equivalent — Claude Code

Claude Code reaches the same capability through MCP itself: register the Playwright MCP server and the agent gains tools to open a page, read the rendered DOM, screenshot it, and interact. It is the same job — see a page that has no API, extract the rows, cache the raw capture under a checksum, derive the events supplement — reached through a server you register rather than a capability built into the session. The seam is the registration: one more entry in .mcp.json, and the agent that builds your warehouse can also read the web pages your warehouse needs.

Watch this space as of 2026-06 Browser access is converging fast on both roadmaps; recheck quarterly.

Guided Run — Building the Warehouse

Field Terminal — session: c3-warehouse Claude Code

claude

The simulator needs JavaScript. The full transcript of this run is described in the lesson text above — nothing below is required reading.

Field Assignment

Artifact make check-c3 passes

Build the warehouse, wire the enrichment, and materialize the analysis panel — all of it under the B3 pipeline profile and the C2 contracts, so every write that builds the panel passes the gate that protects it.

Claude Code

Register the DuckDB server in .mcp.json (project-scoped) and confirm its tools load at session start.
Have the agent explore INFORMATION_SCHEMA, then build warehouse.duckdb from the TLC CDN via read_parquet and register C2’s trips_clean as a view.
Build weather_hourly, events, holidays, and zones — each cached-raw-then-derived, every API response checksummed into data/raw/api/ before any transform.
Build analysis_panel: a complete zone × hour grid, zero-cells included via LEFT JOIN + coalesce, with the DST handling above — a 743-hour local spine for March, durations computed in UTC.
make check-c3.

Codex

Register the DuckDB server under [mcp_servers] in config.toml (project-scoped) and confirm its tools load at session start.
Have the agent explore INFORMATION_SCHEMA, then build warehouse.duckdb from the TLC CDN via read_parquet and register C2’s trips_clean as a view.
Build weather_hourly, events, holidays, and zones — each cached-raw-then-derived, every API response checksummed into data/raw/api/ before any transform.
Build analysis_panel: a complete zone × hour grid, zero-cells included via LEFT JOIN + coalesce, with the DST handling above — a 743-hour local spine for March, durations computed in UTC.
make check-c3.

make check-c3 validates schemas, zero-cell completeness, the DST hour counts (743 for March, 745 for November), and that every enrichment table has a cached raw response under its checksum. This panel is what D1–D4 estimate on — every elasticity, every event study, every robustness spec reads from the table you build here, which is why its zero-cells and its DST handling have to be right exactly once.

Milestone gate · make check-c3advances C3

warehouse.duckdb built via read_parquet from the TLC CDN; C2's cleaned tables registered as views
The agent explores INFORMATION_SCHEMA first — it should learn the schema, not assume it.
weather_hourly, events, holidays, zones enrichment tables built from cached-raw API responses
Every API response checksummed into data/raw/api/ BEFORE any transform — cache-raw-then-derive.
analysis_panel is a complete zone × hour grid with zero-demand cells INCLUDED
An hour with no pickups is a real 0, not a missing row; left-join the grid, coalesce to 0.
DST handled as duration-vs-clock, not as phantom timestamps
March has 743 panel hours, not 744; a gap-spanning trip's clock minutes overstate its true minutes.
All four hook contracts from C2 still pass over the enriched warehouse
C3 builds on C2's gate — the panel is only worth plumbing because every write passed it.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

[both] 〜〜

Deriving from an API response without caching the raw payload first. The Open-Meteo archive and the Socrata portal both revise — silently, on their own schedule — and a join you built against last month’s response cannot be reproduced against this month’s. Checksum the raw bytes into data/raw/api/ before any parse; the cached payload is the artifact of record the referee will ask for in D4.
[both] 〜〜

Using one timezone answer for two timezone jobs. A local-hour spine is correct for hour-of-day demand and wrong for trip durations; a UTC duration is correct for elapsed time and wrong for “what hour was this.” The gap-spanning trip that reads 118 clock-minutes but elapsed 58 is the classic silent bug — the panel’s DST checks exist to catch exactly it.
[both]

Building the panel with an inner join and losing the zeros. An hour a zone saw no pickups is a 0, and a demand model that never sees its zeros is estimating on a truncated sample without knowing it. Grid first, demand left-joined, nulls coalesced to zero — the completeness is the data, not a formatting nicety.
[both] 〜〜

MCP servers with write access to systems you care about. A server that can write your warehouse is a server that can corrupt it; scope its credentials per B3’s least-privilege profiles, and prefer a read-only connection for any server whose only job is to be queried. The protocol’s convenience is not a reason to hand it your keys.
[CC]

A globally registered server piles its tools into every session. The DuckDB or Playwright server you add to your user config loads — and costs context and startup — on every project, including the ones with no warehouse and no web sources. Register servers per project in the repo’s .mcp.json, and the next project starts clean.

Check Your Bearings

C3 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

The interactive check needs JavaScript — without it this section shows only the quiz cover. The lesson text above is complete without the quiz; answers and journal recording require JavaScript.

Field journal

Record how the panel handles DST: how many hours March has, and the one gap-spanning trip whose clock duration would have lied to you.

as of June 2026

MCP is a genuine parity feature: both tools speak the same protocol and register servers the same way in substance, differing only in surface — .mcp.json / claude mcp add on one side, [mcp_servers] in config.toml / codex mcp add on the other. The asymmetry is at the edge of the lesson, not its center: seeing a web page that has no API is native to Codex (its in-app browser and computer use) and reached in Claude Code through MCP itself (the Playwright server) — same job, one built in and one registered. Both converge quickly; the warehouse and the enrichment joins, which are the milestone, are identical either way.

Feature-parity matrix

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Positions

the data manager

Position vacant — engaged at C2

write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter
the methodologist

Position vacant — engaged at C1

the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
the data engineer

Position vacant — engaged at C3

MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
the RA pool

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the overnight RA

Position vacant — engaged at D3

/loop supervision + Goal Mode runs over background estimation

est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine
the adviser

Position vacant — engaged at D1

parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
the referee

Position vacant — engaged at D4

contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
the lab manager

Position vacant — engaged at E2

scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
the reproducibility checker

Position vacant — engaged at E1

headless invocation + the fresh-clone replication self-test + CI gates

est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
the the wall — the unstaffed midnight hours between a raw file and a first plot

Position vacant — engaged at A1

the bare agent loop (prompt → act → observe → fix), zero configuration

est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free
the you, working an order of magnitude faster — but only if you direct the work

Position vacant — engaged at A2

the command surface + five prompting patterns + context hygiene

est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
the the lab manual nobody writes — the institutional knowledge that lives in your head

Position vacant — engaged at B1

instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter
the careful senior who plans before touching data

Position vacant — engaged at B2

repo scaffold + pinned environments + read-only Plan mode reconnaissance

est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention
the the lab whose members don't overwrite each other

Position vacant — engaged at D2

git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
the the onboarding the lab never has to repeat

Position vacant — engaged at E3

lab-kit — the whole methodology packaged as a one-command install

est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
the the whole lab, orchestrated — the PI who designs the system instead of doing the work

Position vacant — engaged at F1

the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson	Role	Est. human-RA	Agent (yours when measured)
A1	the wall — the unstaffed midnight hours between a raw file and a first plot	an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work	~10 minutes for the quick win, plus the same task re-run in the other language for free
A2	you, working an order of magnitude faster — but only if you direct the work	the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong	~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1	the lab manual nobody writes — the institutional knowledge that lives in your head	~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down	written once in an hour; reloaded free at the start of every session thereafter
B2	careful senior who plans before touching data	~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots	an afternoon — most of it download wall-clock, not attention
B3	the data manager who guards the raw files — the person who says no near the master copies	permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads	two profiles configured once in minutes; the fence then holds every session, tired or not
C1	the methodologist — the one person who knows how the lab actually decides	the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do	an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2	data manager / QA who never sleeps	permanent vigilance — est. 2 weeks/year of load-checking and release-note reading	half a day to install and test the 9-line block; ~20 s per run thereafter
C3	the data engineer who wires the lab to its systems	days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes	register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1	the RA pool — and the adviser who critiques from outside	a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will	~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2	the lab whose members don't overwrite each other	the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time	two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3	overnight RA	one night shift per estimation batch — and the course runs several batches	~10 min to write the check or the objective; the night itself belongs to the machine
D4	an RA bench and the PI who keeps their results comparable	the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for	13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1	reproducibility checker	a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission	~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2	lab manager's standing chores	a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped	~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3	the onboarding the lab never has to repeat	six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over	~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1	the whole lab, orchestrated — the PI who designs the system instead of doing the work	each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits	the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed		0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.

Plugging Into Data Systems

The Pain

Why / When

Mechanics

The MCP model, and the honest sidebar

Registering a server

Claude Code

Codex

The warehouse

Enrichment joins, cached raw then derived

Python

R

Building the analysis panel, and the DST that everyone gets wrong

dst_window 35 rows

phantom_hour_trips count = 0

heat 168 rows

snow_feb13_cm value = 19.11

Sources without an API — the in-app browser

In-app browser / computer use

Guided Run — Building the Warehouse

Field Assignment

Claude Code

Codex

Pitfalls & Gotchas

Check Your Bearings

Ledger — C3

The Lab Roster

Your position

Positions

Running Totals

The Pain

Why / When

Mechanics

The MCP model, and the honest sidebar

Registering a server

✳ Claude Code

⬡ Codex

The warehouse

Enrichment joins, cached raw then derived

Python

R

Building the analysis panel, and the DST that everyone gets wrong

dst_window 35 rows

phantom_hour_trips count = 0

heat 168 rows

snow_feb13_cm value = 19.11

Sources without an API — the in-app browser

Guided Run — Building the Warehouse

✳ Claude Code

⬡ Codex

Pitfalls & Gotchas

Parity note

Claude Code

Codex

Claude Code

Codex