Cheat sheet

D2 intermediate ~45 min

One Repo, Many Hands: Worktrees

Absorbs: the lab whose members don't overwrite each other

Advances D2

The Pain

You had two clean ideas and one afternoon, so you ran them at once. One agent was refining the demand-elasticity prep; the other was building the trip-duration robustness checks. Same repository, because it was faster, and what could collide — they were working on different parts of the analysis. For twenty minutes nothing did.

Then both reached for src/build_panel.py, because both needed the panel and neither knew the other existed. The first agent rewrote the zone join; the second, a beat later, rewrote the same function for duration weighting and saved over it. Git, asked to hold two incompatible edits to one file from two writers who never spoke, did the only thing it could: it thrashed. A merge conflict you did not author, in a file neither of you finished, while a third write — a results file from the first run — landed on top of the second run’s half-written output. You spent the rest of the afternoon not doing either analysis but disentangling them, reading diffs to reconstruct which agent meant what, and you got it wrong once and had to do it twice.

A real lab does not seat two researchers at one desk and one keyboard. Each gets their own workspace, their own copy of the shared materials, and the work is combined deliberately, by someone whose job is to combine it — not by collision. The parallelism was never the problem. The shared desk was.

Why / When

The moment two agents work the same repository at once, they trample each other: half-written transforms collide, results overwrite, git thrashes on edits no single author made. The mechanism a lab uses to prevent this is the same one a version-control system already offers — a separate working directory per worker, backed by the same shared history, combined through ordinary review. Each agent gets its own checkout; nobody writes where anybody else is writing; merges happen on purpose, by a human acting as referee.

This is the load-bearing mechanic for everything at scale that follows: the overnight runs in D3 each want their own tree so a 3 a.m. failure is a deletable directory, and the fleets in D4 want isolation per code-touching variant. It earns its own lesson because getting it wrong is not a style problem — it is the lost afternoon in the Pain vignette, and it scales with the number of hands. The discipline accelerates nothing on its own; what it does is make parallelism safe, which is the only thing that makes parallelism worth doing.

Contrary winds

Not for: agents that only ever write results — never code — under a shared contract: a manifest with one-file-per-run rules lets them share one tree safely, and a worktree each is then just ceremony (the D4 pattern).

Mechanics

Field note

There is nothing language-specific here: worktrees are a git mechanic, and the agents inside them may write Python or R without changing a word of this page. That is why it declares no R variants.

The mechanic

A git worktree is a second working directory attached to the same repository: one shared object store and history, but each worktree has its own checked-out files and its own branch. You create one with a single command, and the agent (or session) that works there cannot touch another worktree’s files because they are literally a different directory:

one desk per worker
# from the main checkout, give each workstream its own tree + branch
git worktree add ../weather-mobility-w1 -b w1-elasticity
git worktree add ../weather-mobility-w2 -b w2-duration
git worktree list # the main tree plus the two new desks
# … work happens in each independently; combine through review:
git switch main && git merge w1-elasticity # deliberate, reviewed

This beats the obvious alternative — copying the whole folder twice — on every axis that matters: the worktrees share history, so a commit in one is visible to all and there is no re-syncing; they are cheap, sharing the object store rather than duplicating it; and they force a disciplined merge, because combining work means a real git merge a human reviews, not a file-copy nobody audited. The one case where you do not need them is the notFor above: agents that only write results under a contract never touch shared code, so they can share one tree — the manifest, not the worktree, is doing the isolation there (D4).

System Player film — Worktree Collision Counterfactual
Worktree collision counterfactual: two agents writing one shared working tree overwrite each other's transforms and thrash git into a corrupt state; replayed with one git worktree per agent, the writes never touch and the branches merge clean. WITHOUT WORKTREES — ONE SHARED TREE REPLAY — ONE WORKTREE PER AGENT GIT THRASH AGENT A AGENT B SHARED WORKING TREE CORRUPT OVERWRITTEN, HALF-DONE AGENT A AGENT B WORKTREE A WORKTREE B CLEAN MERGE ZERO COLLISIONS
step 1/7

Step 1 of 7.

You want two agents working at once — A on the cleaning transforms, B on the figures. The obvious move is to point both at the same checkout and let them go. One repo, one working tree, two sets of hands.

The two tools reach the same isolation by different routes — one through local primitives you compose, one through a managed multitasking model — so this is a dual treatment, not a tab. Neither hides; the contrast is instructive.

Claude Code Your tool

A session per worktree, and agents spawned into one

The recommended pattern is one claude session per worktree: open the elasticity tree in one, the duration tree in another, and they cannot collide because each is rooted in a different directory. The desktop app runs these as parallel sessions across worktrees side by side.

Beyond hand-driven sessions, a subagent or workflow can be spawned directly into a fresh worktree — the orchestrator creates the tree, runs the agent there, and auto-cleans the worktree if the agent left it unchanged, so a survey that produced nothing leaves no litter. This is the primitive D4’s fleet stands on: each code-touching variant gets its own worktree, created and disposed of by the workflow, isolated from the others by construction. The composition — worktree, plus agent, plus auto-cleanup — is something you assemble from pieces, which is the Claude Code shape throughout.

Codex Your tool

Parallel threads natively, and cloud tasks as the managed worktree

The desktop app’s core model is multitasking: it runs parallel threads, each backed by its own worktree natively — starting a second thread on a second workstream gives it an isolated checkout without your asking, because that is the app’s central abstraction rather than a pattern you assemble. CLI users get the same isolation the ordinary way: a plain git worktree per thread.

The managed analogue goes one step further. A cloud task runs in an isolated cloud environment — its own container, its own branch — and returns a diff when it finishes. That is a worktree you never have to create, clean, or even keep on your laptop: the isolation is the service’s, and what comes back is reviewable exactly like a pull request. The trade is that the isolation is real but opaque — you review the returned diff, you do not watch the desk — which is the managed-delegation shape throughout.

Translation guide
Intent Claude Code Codex
two workstreams at once, locally one session per worktree (desktop: parallel sessions across worktrees) parallel threads, each its own worktree natively (CLI: git worktree per thread)
an agent isolated for a code-touching task subagent/workflow spawned into a fresh worktree, auto-cleaned if unchanged a cloud task — isolated container + branch, returns a reviewable diff
combining the work deliberate git merge, human as referee review the returned diff like a PR, then merge

Worktree discipline for analysis projects

The mechanic is cheap; the discipline is what keeps it honest, and it is the same in either tool:

  • Branch by workstream. w1-elasticity, w2-duration — the branch name says which analysis it carries, so the merge referee knows what they are combining before they read a line.
  • Know what merges and what never does. Code and specs merge — they are the shared methodology. Scratch outputs do not: a results file is owned by a contract (D4), not reconciled by a git merge. Merging two agents’ results/ is how you get a file that is neither run, and it is the second collision in the Pain vignette.
  • Keep results/ out of worktree merges. The D4 results contract governs result files; the git merge governs code. Conflating them re-creates exactly the overwrite you used worktrees to prevent.
  • The human is the merge referee. Parallelism is safe only because someone deliberately decides what combines. The tool isolates; you reconcile. Never let a merge happen by collision instead of by decision.

Guided Run — One Repo, Many Hands: a desk per workstream

Field Terminal — session: d2-worktrees Claude Code
git worktree add ../weather-mobility-w1 -b w1-elasticity

Guided Run — One Repo, Many Hands: a thread per workstream

Field Terminal — session: d2-worktrees Claude Code
git worktree add ../weather-mobility-w1 -b w1-elasticity

Field Assignment

Artifact make check-d2 passes — both branches merged, history linear per workstream, zero collisions

Run the project’s two workstreams concurrently — and prove they never touched each other.

  1. Give each workstream its own worktree and branch: w1-elasticity for the demand-elasticity prep, w2-duration for the trip-duration robustness prep.
  2. Run both at the same time — per your tool below — one agent refining the elasticity prep, one building the duration prep, each rooted in its own tree.
  3. Merge both branches back to main deliberately, as the referee: code and specs merge; no results/ file is reconciled by the merge.
  4. Demonstrate zero collisions: each workstream’s history is linear, and no file was overwritten across trees.

Claude Code

Open one claude session per worktree (or spawn a worktree-isolated agent per workstream). Let them run simultaneously; confirm neither session can see the other’s working files. Merge w1-elasticity then w2-duration into main, reviewing each diff as the referee.

Codex

Run the two workstreams as parallel threads (each its own worktree), or hand one to a cloud task and review its returned diff. Confirm the threads never share a working directory. Merge both branches into main, reviewing each diff — the cloud task’s exactly like a PR from a new student.

make check-d2 verifies both branches merged cleanly and that each workstream’s history is linear — no cross-tree overwrite, no merge you did not author. This is the mechanic D3 runs its overnight jobs inside and D4 fans its fleet across.

Milestone gate · make check-d2advances D2
  1. Branch by workstream so the merge referee knows what they're combining.

  2. Merge code and specs; let the contract own the outputs.

Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.

Pitfalls & Gotchas

  • [both] 〜〜

    “Two agents in one tree, just this once.” The collision does not happen the afternoon you decide it is fine — it happens the afternoon it matters, on the file you cared about, and you spend the day reconstructing which writer meant what. For a result you intend to publish, an unaudited overwrite is not an inconvenience; it is a number you can no longer explain. The worktree is one command and it is the difference.

  • [both]

    Merging scratch outputs. Results belong to contracts (D4), not to git merges: reconciling two agents’ results/ produces a file that is neither run. Merge code and specs; let the contract own the outputs.

  • [CC]

    Worktree-spawned agents that mutate global state — installed packages, shared caches, a global config — escape the isolation the worktree gave them, because that state lives outside any tree. Keep environments per-worktree (B2’s pinned lockfile, restored inside each tree) so the isolation is real and not just file-deep.

  • [CX]

    Cloud-task isolation is real but opaque: you do not watch the work, you receive a diff. Review that diff like a pull request from a new student — line by line, asking what it touched and why — not like a trusted teammate’s. Opaque isolation only protects you if you read what comes back.

Check Your Bearings

D2 · 4 questions · unlimited retries, no timer

This check opens when the guided simulation above is complete — the questions assume you have seen the run.

(noted in your field journal as an override)

Field journal

as of June 2026

Parity note

This is a Tier-2 split: both tools deliver true per-worker isolation, but by different primitives. Claude Code composes it from local pieces — a session or a spawned agent per git worktree, with auto-cleanup of unchanged trees — the assemble-it-yourself shape. Codex makes parallel worktree-backed threads the desktop app’s native model and offers cloud tasks as a managed worktree you never create, clean, or hold locally, returning a reviewable diff — the managed-delegation shape. The underlying git worktree is identical and available to both via the CLI; the asymmetry is in how much the tool manages for you, and it is the same design philosophy that runs through D3 and D4.

Ledger — D2

The Lab Roster

Engraved positions, not portraits. A seat fills itself when its lesson is complete.

Your position

Lesson A1Lesson A2Lesson B1Lesson B2Lesson B3Lesson C1Lesson C2Lesson C3Lesson D1Lesson D2Lesson D3Lesson D4Lesson E1Lesson E2Lesson E3Lesson F1abcdef

Positions

  • the data manager

    Position vacant — engaged at C2

    write-time contract hooks (PreToolUse/PostToolUse + the validation suite)

    est. human-RA: permanent vigilance — est. 2 weeks/year of load-checking and release-note reading agent: half a day to install and test the 9-line block; ~20 s per run thereafter

  • the methodologist

    Position vacant — engaged at C1

    the researcher skill library v1 (/clean-trips, /paper-summary, /demanding-adviser) — codified methodology, not macros

    est. human-RA: the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do agent: an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked

  • the data engineer

    Position vacant — engaged at C3

    MCP connections + the DuckDB warehouse, enrichment joins (weather/events/holidays), and the zone-hour analysis panel

    est. human-RA: days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes agent: register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication

  • the RA pool

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the overnight RA

    Position vacant — engaged at D3

    /loop supervision + Goal Mode runs over background estimation

    est. human-RA: one night shift per estimation batch — and the course runs several batches agent: ~10 min to write the check or the objective; the night itself belongs to the machine

  • the adviser

    Position vacant — engaged at D1

    parallel subagents with report contracts (EDA + scholarship fleets) + the isolated adviser

    est. human-RA: a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will agent: ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes

  • the referee

    Position vacant — engaged at D4

    contracted fleet fan-out (results contract + provenance) and an isolated adversarial referee

    est. human-RA: the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for agent: 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass

  • the lab manager

    Position vacant — engaged at E2

    scheduled/cloud agents — the monthly-ingest routine, stopping at a human-approved PR

    est. human-RA: a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped agent: ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate

  • the reproducibility checker

    Position vacant — engaged at E1

    headless invocation + the fresh-clone replication self-test + CI gates

    est. human-RA: a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission agent: ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter

  • the the wall — the unstaffed midnight hours between a raw file and a first plot

    Position vacant — engaged at A1

    the bare agent loop (prompt → act → observe → fix), zero configuration

    est. human-RA: an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work agent: ~10 minutes for the quick win, plus the same task re-run in the other language for free

  • the you, working an order of magnitude faster — but only if you direct the work

    Position vacant — engaged at A2

    the command surface + five prompting patterns + context hygiene

    est. human-RA: the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong agent: ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts

  • the the lab manual nobody writes — the institutional knowledge that lives in your head

    Position vacant — engaged at B1

    instruction files (CLAUDE.md / AGENTS.md) + auto-memory + the A/B demonstration

    est. human-RA: ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down agent: written once in an hour; reloaded free at the start of every session thereafter

  • the careful senior who plans before touching data

    Position vacant — engaged at B2

    repo scaffold + pinned environments + read-only Plan mode reconnaissance

    est. human-RA: ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots agent: an afternoon — most of it download wall-clock, not attention

  • the the lab whose members don't overwrite each other

    Position vacant — engaged at D2

    git worktrees — one isolated checkout per agent/session/thread, combined through a deliberate merge

    est. human-RA: the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time agent: two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end

  • the the onboarding the lab never has to repeat

    Position vacant — engaged at E3

    lab-kit — the whole methodology packaged as a one-command install

    est. human-RA: six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over agent: ~half a day to package and smoke-test the kit once; each new member is one install and one prompt

  • the the whole lab, orchestrated — the PI who designs the system instead of doing the work

    Position vacant — engaged at F1

    the research loop (/loop ↔ Goal Mode / @codex) orchestrating fleet → referee → headless re-run → regenerated report, under report-don't-act guardrails, a hard budget cap, and a human gate on substantive decisions only

    est. human-RA: each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits agent: the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended

Running Totals

Lesson Role Est. human-RA Agent (yours when measured)
A1 the wall — the unstaffed midnight hours between a raw file and a first plot an evening or two per messy file — defensive parsing rewritten from scratch each project, rules forgotten by the time they work ~10 minutes for the quick win, plus the same task re-run in the other language for free
A2 you, working an order of magnitude faster — but only if you direct the work the slow tax of an undriven session — drifted answers on long investigations, re-runs to find where it went wrong ~30 min to learn; thereafter a first-look on one month (3.5M rows) in minutes, with receipts
B1 the lab manual nobody writes — the institutional knowledge that lives in your head ~30 min re-onboarding every new RA, every time — plus the afternoons lost to landmines no one wrote down written once in an hour; reloaded free at the start of every session thereafter
B2 careful senior who plans before touching data ~1 week at project start (setup, download babysitting, plan review) + the joins redone when structure rots an afternoon — most of it download wall-clock, not attention
B3 the data manager who guards the raw files — the person who says no near the master copies permanent vigilance you cannot staff — one lapse at machine speed costs a month of re-downloads two profiles configured once in minutes; the fence then holds every session, tired or not
C1 the methodologist — the one person who knows how the lab actually decides the judgment lives in one head; transferring it to a new RA costs weeks of shadowing, and leaves when they do an afternoon to author three SKILL.md files in both dialects; zero cost per session until invoked
C2 data manager / QA who never sleeps permanent vigilance — est. 2 weeks/year of load-checking and release-note reading half a day to install and test the 9-line block; ~20 s per run thereafter
C3 the data engineer who wires the lab to its systems days of bespoke glue per source — credentials, retries, schema spelunking, timezone forensics — re-debugged every time a source changes register the server once; the agent explores INFORMATION_SCHEMA and builds the panel in a guided session, raw cached for replication
D1 the RA pool — and the adviser who critiques from outside a week of breadth EDA across boroughs and slices, plus a literature pass — and no honest outside critic you can summon at will ~20 min to write the agent definition + report contract; the fleet runs in parallel; the isolated adviser critiques in minutes
D2 the lab whose members don't overwrite each other the lost afternoon disentangling two agents' colliding edits — and the redo when you reconstruct it wrong the first time two commands to create the worktrees; the parallelism runs free; one reviewed merge at the end
D3 overnight RA one night shift per estimation batch — and the course runs several batches ~10 min to write the check or the objective; the night itself belongs to the machine
D4 an RA bench and the PI who keeps their results comparable the curve is ~2 days of serialized edit-and-fit; the suspicious read of the robustness table is the rarer, senior hour nobody has time for 13 lanes fanned out under the cap finish in an afternoon; the referee files its evidenced finding in one isolated pass
E1 reproducibility checker a clean-room rebuild every few weeks — dull, exacting, and the first thing dropped at submission ~20 min to wire scripts/replicate.sh and the gate workflow; the verdict returns in one headless run thereafter
E2 lab manager's standing chores a recurring monthly chore nobody owns — check the CDN, pull, contract, append, re-estimate — reliably skipped ~30 min to define the routine + guardrails once; each month runs unattended and stops at the approval gate
E3 the onboarding the lab never has to repeat six weeks of per-member onboarding, rediscovered from scratch every time the lab turns over ~half a day to package and smoke-test the kit once; each new member is one install and one prompt
F1 the whole lab, orchestrated — the PI who designs the system instead of doing the work each revision is a serialized chain — re-spec, re-estimate, re-table, rewrite the paragraph, re-read the abstract — correct only as of the last manual pass, on a Sunday; a real reviewer round is days of hand-carried edits the loop runs two iterations to convergence in one supervised sitting; the human stands at exactly one gate (approve dropping the post-treatment control) while the mechanical fixes proceed unattended
Positions absorbed 0 of 16

The honest column: every place a human had to step in lives in the Field Journal’s failure log. Your measured hours there override these estimates here.