Agent Sandbox Security

An agent with filesystem access and an open network can leak your credentials. Here's the two-layer egress and approval config that closes the gap.

Agent SecurityAI OperationsGuide

Kimmo Nurmisto

Founder, Grolea · 7 min read

Give an agent a filesystem and an outbound network connection, and you have handed it everything it needs to read a credential and send it somewhere you never chose. That is the threat in one sentence, and if you run an agent fleet in production, you have probably already felt the edge of it. The question is not whether the exposure exists. It does. The question is which two dials you turn to close it.

This is an operator's threat model for agent sandbox security, not a generic DevSecOps checklist. The exposure pattern is specific to how agents run: long-lived, semi-autonomous, holding real secrets, acting on instructions that can come from a tool output or a web page you don't control. Below is what an unsandboxed agent can actually reach, and the two-layer config that takes the credential leak off the table.

The threat: what an unsandboxed agent can reach

Start with what your agent is holding. To do real work it needs secrets: an API key for the model provider, a token for your issue tracker, cloud credentials, maybe a database URL. Those usually live in environment variables or dotfiles, which is exactly where the agent's own process can read them. printenv, a peek at ~/.aws/credentials, a glance at an .env file in the repo it just checked out. Nothing exotic. The agent has filesystem access by design, and credentials sit on that filesystem.

Now add the second half: an open outbound network. An agent that can read a secret and also make an arbitrary HTTPS request can move that secret off the box in a single step. It does not take a malicious model to get there. It takes one prompt-injected instruction buried in a fetched web page, one poisoned dependency, and one tool that returns "helpfully" formatted text the agent decides to act on. The agent was never compromised in the dramatic sense. It just did what an agent does, with reach it should never have had.

Operators know this is real because they keep running into it. The egress credential-firewall thread in the Hermes community filled with operators comparing notes on exactly this exposure, which is the kind of back-and-forth that only happens around a problem people have actually hit. The conversation that started there names the failure mode precisely: agent credential isolation is not a property you get for free by putting the agent in a container. A container with your production keys in its environment and the open internet on its other side is not isolated in any way that matters.

So the goal is concrete. Assume the agent can read every secret it has access to. Assume that an instruction it follows might be hostile. Then make it so that knowing a secret does the attacker no good, because the secret has nowhere to go and the dangerous action never executes. That is two separate controls, and you want both.

Layer one: the egress firewall

The first dial controls what leaves the sandbox. By default, an agent's network is wide open: it can reach any host on the internet. An egress firewall inverts that. You move to default-deny on outbound traffic and allow only the specific destinations the agent legitimately needs: your model provider's endpoint, your own APIs, the package registry, whatever the job actually requires. Everything else is refused at the network boundary.

The clean way to enforce this is an egress proxy. Route all the agent's outbound traffic through a proxy that holds the allowlist, and the agent never talks to the open internet directly. The proxy decides, per request, whether the destination is on the list. This is the pattern operators mean when they say egress proxy agent: the agent's reach is mediated, logged, and constrained to a set you approved in advance.

Here is why this is the load-bearing layer. Exfiltration needs a destination. If the agent reads your cloud key but the only outbound hosts it can reach are your model provider and your own services, there is no endpoint to ship the key to. The attacker's instruction succeeds at reading the secret and then dies at the network boundary. You have not stopped the agent from being tricked, which is genuinely hard. You have made being tricked not pay off, which is achievable today with the config you already understand.

The proxy log is a second benefit worth naming. Default-deny means every blocked request is a recorded event, so an agent suddenly trying to POST to an address nobody allowlisted is a signal you can alert on rather than a thing you discover in a breach report later.

Layer two: dangerous-command approval

Network control is necessary but not sufficient. The agent still executes commands locally, and some of those commands are dangerous, whether or not a single packet leaves the box. Writing to a credentials path. Installing an unvetted package. Running curl | sh on something a tool suggested. Deleting state you can't recover. The egress firewall says nothing about these because they happen inside the sandbox.

The second dial is a dangerous-command-approval gate: a checkpoint that pauses the agent before it runs a sensitive command and requires a sign-off, human or policy, before the command proceeds. Routine work runs untouched. The narrow set of actions that can cause real damage stops and asks. This is the same approval-gate thinking that governs any well-run agent fleet, applied to local execution instead of to publishing or spending. If you have set up gates before, the model is familiar; here, the gated actions are the commands that can compromise the host.

What makes this worth the friction is that operators are converging on it independently. The egress-firewall conversation did not stay on networking. It pulled in a second cluster of operators arguing for a local execution gate, because they had each worked out on their own that controlling the network is only half the perimeter. When people who have built fleets separately arrive at the same second control, that is a strong signal that the single-layer version leaves a real gap. Reliability and security turn out to be the same infrastructure conversation: the gate that stops an agent from running a destructive command is the same class of mechanism that keeps a misread failure from cascading through your fleet.

The config checklist

If you want to harden an existing agent stack this week, this is the order I would work in:

Inventory the secrets the agent can read. Environment variables, dotfiles, mounted credential files, anything in the working tree. Assume everyone is readable by the agent process, because it is.
Turn on default-deny egress. Block all outbound traffic, then allow back only the destinations the job needs. If you can't list them, you don't yet know what your agent is talking to, and that is the first finding.
Route outbound through an egress proxy. Centralise the allowlist in a proxy the agent must go through, so enforcement and logging live in one place instead of being scattered across the agent's own code.
Alert on blocked egress. A denied outbound request is a security event. Wire it to wherever you actually look, not to a log nobody reads.
Add a dangerous-command-approval gate. Define the narrow set of local commands that require sign-off (credential-path writes, package installs, network fetches piped to a shell, irreversible deletes) and gate exactly those.
Scope credentials down to what each agent needs. The fewer secrets an agent holds and the narrower their permissions, the less a successful read is worth. Defence in depth behind the first two layers, not a replacement for them.

Work top to bottom, and the credential leak that was one misconfigured variable away becomes a sequence of controls an attacker has to beat in order, with a logged alarm on the way through.

The egress firewall is infrastructure you stand up around the runtime. The governance half is a separate layer: the dangerous-command approval gates, per-agent credential scoping, spend caps, and escalation paths that Paperclip gives an agent company out of the box, instead of you wiring each one by hand for every agent you run.

The threat: what an unsandboxed agent can reach

Layer one: the egress firewall

Layer two: dangerous-command approval

The config checklist

Securing Your Agent Runtime

How to Set Up a Multi-Agent Team