The Pain
There is a person, in a well-run lab, whose entire job is to stand between everyone else and the raw data. They keep the master copies under lock, hand out working copies, and say no a great deal: no, you cannot edit the original; no, that script does not get to run against the production tables; no, you may not paste the database password into a shared notebook. They are not popular on the afternoons they say it. They are indispensable on the mornings they were right.
You do not have that person. You have a new collaborator who works at
machine speed, takes instructions literally, and has — until you decide
otherwise — exactly the same reach into your filesystem and your shell as
you do. Most of the time that reach is a gift. The exception is the one
that matters: the afternoon it reasons its way to the cleanest fix is to
rebuild this table from scratch, and DROP TABLE trips_raw is a single
confident command away from a month of re-downloads. Or the moment it
decides a raw Parquet would be tidier with the duplicates removed, and
overwrites the one file in the project that cannot be regenerated. Speed
is the multiplier on both the help and the harm. The data manager’s no
was never about distrust. It was about making the expensive mistakes
impossible rather than merely unlikely — and that is a thing you can
configure before the stakes arrive, instead of a person you must hire.
Why / When
A safety model is the set of rules that decide, before any action runs, whether the agent may do it outright, must ask you first, or is forbidden. It is the configured form of the data manager’s judgment: named risk levels you switch between deliberately, so the agent’s reach matches the task instead of always matching yours.
The discipline belongs at the front of the project, before Unit C hands the agent the power to modify data and run pipelines. The threat model is specific and real: destructive SQL against the warehouse, overwrites of irreplaceable raw files, an API loop that runs unbounded overnight, credentials that leak into a transcript and live there forever. None of these require malice — only literal-mindedness at speed. You configure the fence now, while the only thing it costs is five minutes, rather than later, while it costs a month.
The honest non-use case is the throwaway: a read-only look at public data on a clone you will delete. There is nothing there to protect, and the friction of a permission prompt buys you nothing. Everywhere the data is real, the fence earns its keep on the first afternoon you would otherwise have lost.
Contrary winds
Not for: reading-only exploration of public data on a throwaway clone, where there is nothing destructive to fence and the friction buys you nothing.
Mechanics
Both tools implement the same idea — name your risk levels, switch deliberately — through different surfaces. The shared principle first, the dialects in tabs, then the two profiles you will use all course, and the refusal that proves the fence is real.
The two safety models
Underneath both tools sit two layers that compose. The decision layer asks, for each proposed action, allowed, ask, or denied? — and the sandbox layer is the operating system enforcing a boundary the agent cannot reason its way around, so that even a misconfigured decision rule cannot reach outside the fence. Decision rules are policy; the sandbox is physics. Defense in depth means both.
Claude Code
Claude Code’s decision layer is permission rules, evaluated
deny → ask → allow: a deny rule wins outright, an ask rule pauses for
your approval, an allow rule proceeds silently. Rules match tools and
their arguments, so you can allow Bash(duckdb:*) reads while denying
Bash(rm:*) and asking on everything else.
Stacked on top are permission modes — session-wide postures like the default ask-on-write, a read-only plan posture, and an accept-edits posture for trusted bulk work — and beneath everything, the OS sandbox that confines file and network access at the operating-system level regardless of what the rules say.
The rules live in settings.json, and the layering is the point: user,
project, and local files merge, so a team can commit a project-layer
deny that an individual cannot silently loosen.
{ "permissions": { "deny": [ "Write(./data/raw/**)", "Bash(rm:*)", "Bash(duckdb:* DROP *)" ], "ask": [ "Write(./data/processed/**)", "Write(./results/**)" ], "allow": [ "Bash(duckdb:* SELECT *)", "Read(./**)" ] }}Codex
Codex’s decision layer is approval modes, three named postures you switch between: read-only (the agent may read and run read-only commands, nothing else), auto (it works in the workspace and asks before stepping outside it), and full-access (it acts without prompting — reserved for work you are watching). The mode is the coarse dial; you turn it down for exploration and up only deliberately.
Beneath that sits the OS sandbox — Seatbelt on macOS, Landlock on Linux — confining file and network reach at the kernel level, so a read-only session physically cannot write even if a rule were misconfigured.
Both compose into named profiles in config.toml: a profile bundles
an approval mode and a sandbox policy under one name you select per
session, and the layered .codex/ team config lets a project commit
profiles every clone inherits.
[profiles.exploration]approval_policy = "read-only"sandbox_mode = "read-only" # Seatbelt / Landlock: no writes at all
[profiles.pipeline]approval_policy = "auto"sandbox_mode = "workspace-write"writable_roots = ["data/processed", "results"] # raw/ stays outBoth reduce to one practice: name your risk levels and switch between them deliberately. The names below are the ones the whole course uses.
Two named profiles
A research project needs exactly two postures, and naming them is what makes switching a deliberate act rather than a vague feeling of caution:
- exploration — read-only data, no writes, no network egress. This is the default posture for surveying the warehouse, drafting plans, and any session where you are thinking rather than building. It is also what B2’s plan-first reconnaissance runs under, and what the refusal demo below exercises.
- pipeline — writes scoped to
data/processed/andresults/only;data/raw/stays read-only even here, because raw is the one thing that cannot be regenerated. This is the posture for running transforms and estimation — the agent can build, but it cannot touch the source of truth or escape the directories it is allowed to rebuild.
Least privilege means defaulting to exploration and rising to
pipeline only for the span of work that genuinely writes, then falling
back. The fence you most need is the one between every profile and
data/raw/: in both profiles, raw is read-only, because the cheapest
month you will ever spend is the one you never have to re-download.
One discipline both tools share and both make easy to forget: fencing the
file tools but leaving the shell open. The shell is the universal escape
hatch — rm, psql, a stray > redirect — so a profile that denies
Write(./data/raw/**) but allows arbitrary shell has fenced the front
door and left the window open. Scope the shell too.
Credentials hygiene
The warehouse password, the API token, the cloud key: none of them belong
in a prompt, and none of them belong in config.toml or settings.json.
A secret pasted into a prompt does not vanish when the answer comes back —
it lives in the transcript, which is logged, synced, and quite possibly
committed, forever. The rule is unconditional: secrets come from the
environment, never from the conversation.
# Set in the shell that launches the agent — read from the environment,# never typed into a prompt, never written to a committed config file.export WAREHOUSE_DSN="$(security find-generic-password -s wm-warehouse -w)"# Scripts read os.environ / Sys.getenv — the agent passes the name, not# the value, and the value never enters a transcript.The transcript is the leak surface you forget about. A key typed once, in a moment of haste, is a key you must now rotate — so the discipline is to make typing it impossible, not to remember not to.
Which verdict does an action earn?
The decision layer reduces to three verdicts — allow, ask, deny — and the skill is knowing which a given action deserves before it runs. Walk a handful of real situations from the project; the navigator returns the verdict the situation earns and the reason behind it, in the surface your tool uses:
An action is about to run under your safety model. Which verdict should it earn? Walk a few real situations from the weather-mobility project — the rule is least privilege, deliberately applied, not a vague feeling of caution.
The demonstrated refusal
Safety you have read about is a claim; safety you have seen fire is a
fact. Below, the agent runs under the exploration profile and is asked
to do exactly the thing exploration forbids: drop trips_raw. Before the
refusal plays, you predict the verdict — allow, ask, or deny? Watch the
fence hold.
Guided Run — The Demonstrated Refusal
claude --permission-mode planField Assignment
Artifact both profiles committed; the refusal demonstrated and logged
Configure the two profiles, prove the fence holds by tripping it on purpose, and write down what you saw.
- Author both named profiles — exploration (read-only, no network
writes) and pipeline (writes scoped to
data/processed/andresults/, raw still read-only) — and commit them to the project layer so every clone inherits the fence. - Move your warehouse credential into the environment; confirm no secret appears in any committed config file or transcript.
Claude Code
Commit the permission rules to the project layer of settings.json: a
deny on writes to data/raw/** and on rm/DROP, an ask on writes
to data/processed/**, and allow on reads. Open a session under the
read-only posture and attempt DROP TABLE trips_raw — confirm the deny
rule fires before anything runs.
Codex
Commit both profiles to config.toml (and the layered .codex/ team
config). Start a session under the exploration profile and attempt
DROP TABLE trips_raw — confirm the read-only approval mode and the
sandbox both refuse, then check that pipeline still cannot write to
data/raw/.
- Run the demonstrated refusal in your tool — predict the verdict, watch
it fire — and log the incident in
journal/b3-refusal.md: what you asked, what the fence did, and which profile you were in.
The two profiles are the safety substrate the whole midgame runs on: C2 doubles the raw-data fence at the hook layer, and every later lesson assumes exploration-by-default with deliberate elevation to pipeline.
make check-b3advances B3This is the default posture: surveying, planning, thinking. It is also what B2's reconnaissance runs under.
Raw is the one thing that cannot be regenerated, so no profile may write it.
A key pasted into a prompt lives in the transcript forever. The agent passes the name, never the value.
Predict the verdict (allow / ask / deny), then watch the fence hold. Safety you have seen fire is safety you trust.
Check each item only once it is true of YOUR repo — the gate is self-certified, like the rest of your methodology.
Pitfalls & Gotchas
- [both]
〜〜
Living in the most permissive mode “to avoid friction.” Every prompt you wave through is vigilance you have rented back from the tool that was built to absorb it — and the one time it matters, you will wave through the one that mattered, at 2 a.m., on muscle memory. Default to exploration; rise to pipeline for the span of work that writes, then fall back. Friction at the dangerous moments is the product, not a bug.
- [both]
Fencing the file tools but not the shell. The shell is the universal escape hatch: a profile that denies writes to
data/raw/but allows arbitraryBashhas locked the door and left the window open —rm, a redirect, apsqlone-liner all walk straight through. Scope the shell with the same care as the file tools. - [both]
〜〜
A key pasted into a prompt lives in the transcript forever. The transcript is logged, synced, and often committed; a secret typed once in haste is a secret you must now rotate. The fix is not vigilance — it is making the paste impossible: secrets come from the environment, the agent passes the name and never the value.
- [CX]
Subagents inherit the parent’s sandbox overrides. A child dispatched from a session you elevated to pipeline does not quietly fall back to exploration — it carries the parent’s writable roots with it. This returns in D1, where subagent isolation gets its own treatment; for now, assume reach flows downhill and check before you delegate.
Check Your Bearings
This check opens when the guided simulation above is complete — the questions assume you have seen the run.
(noted in your field journal as an override)Field journal
Parity note
Safety models are a genuine parity feature: both tools compose a decision
layer over an OS sandbox and both reduce to naming risk levels and
switching deliberately — Claude Code through deny/ask/allow permission
rules plus permission modes in layered settings.json, Codex through
read-only/auto/full-access approval modes plus Seatbelt/Landlock,
bundled into named profiles in config.toml. The surfaces differ in
granularity (per-tool-and-argument rules on one side, coarser named modes
on the other), but the practice — exploration by default, deliberate
elevation, raw always read-only — is identical, and the OS sandbox under
both is the layer no misconfiguration can talk its way past.