
frameworks for tool-using agents with safe code execution (Docker sandboxing, restricted network/egress)
Quick Answer: The safest way to let AI agents run tools and code is to combine an agent framework that understands tools with a hardened execution sandbox (typically Docker) and strict network/egress controls. AutoGen gives you this stack out of the box via
autogen-core(runtime), AgentChat agents, and theDockerCommandLineCodeExecutor, so you can iterate quickly without exposing your host system or internal network.
Why This Matters
As soon as you let an AI agent write and execute code, call shell commands, or hit external APIs, your “toy” demo turns into a security surface. Without isolation, a misaligned prompt, an LLM hallucination, or a clever user can:
- Read local files you didn’t intend to expose
- Exfiltrate secrets over the network
- Fork-bomb or DoS your host machine
Frameworks that treat safe execution as a first-class runtime concern—rather than an afterthought—let you prototype powerful tool-using agents while preserving security boundaries, auditability, and predictable blast radius.
Key Benefits:
- Safer experimentation: Use containers and runtime isolation so agents can fail, hallucinate, or explore without compromising your host or tenant data.
- Operational control: Enforce CPU/memory limits, no-root policies, and restricted network/egress on every code execution, not just “in theory.”
- Composability & reuse: Build agents and tools once, then run them locally with a
SingleThreadedAgentRuntimeor in a distributed runtime—without rewriting your agent logic.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Runtime Environment | In AutoGen Core, the runtime that manages agent identities, message routing, and security/privacy boundaries (standalone or distributed). | This is where you actually enforce isolation, topics/subscriptions, and lifecycle—not in your prompt. |
| Tool / Code Executor | A callable that an agent can invoke to run code or shell commands, e.g., DockerCommandLineCodeExecutor. | Determines how “dangerous” a tool call can be; Docker sandboxes let you isolate workloads and restrict access. |
| Network / Egress Control | OS- and container-level rules that determine what the executed code can talk to over the network. | Prevents data exfiltration, lateral movement, and unintended calls to internal or external services. |
How It Works (Step-by-Step)
At a practical level, frameworks for tool-using agents with safe code execution combine three layers:
-
Agent layer (Planner/Worker + tools)
- The agent decides what needs to be executed (e.g., “run this Python snippet”) and calls a registered tool.
- In AutoGen AgentChat, this is often an
AssistantAgentthat has tools like a code executor attached.
-
Execution sandbox (Docker / similar)
- The tool implementation (e.g.,
DockerCommandLineCodeExecutor) receives a command, runs it inside a container, and returns stdout/stderr. - You configure: image, resource limits, read-only file system, mounted volumes, and network/egress.
- The tool implementation (e.g.,
-
Runtime enforcement (Core runtime)
autogen-core’s runtime manages who can call which tools, where the results go, and how long agents run.- You can use standalone (
SingleThreadedAgentRuntime) for local workflows or a distributed runtime topology for multi-tenant, scalable execution.
Below is how to do this with AutoGen step-by-step.
Installation
Python 3.10 or later is required.
Install the AutoGen layers you’ll need:
pip install -U "autogen-agentchat" "autogen-core" "autogen-ext[docker]" "autogen-ext[openai]"
Notes:
autogen-agentchat– high-level agent API (recommended entry point for most workflows).autogen-core– event-driven runtime and topics/subscriptions.autogen-ext[docker]– includesDockerCommandLineCodeExecutorfor sandboxed code execution.autogen-ext[openai]– sample model client; replace with your provider as needed.
Set an OpenAI-style model key (or equivalent for your chosen backend):
export OPENAI_API_KEY="sk-..."
Minimal Safe Code Execution with AutoGen
1. Create a standalone runtime
Start with the standalone runtime for local workflows. You can later switch to distributed without changing your agent implementation.
from autogen_core import SingleThreadedAgentRuntime
runtime = SingleThreadedAgentRuntime()
runtime.start()
2. Set up a Docker-based code executor
Use DockerCommandLineCodeExecutor from autogen-ext to run code in a container, not on your host:
from autogen_ext.code_executor import DockerCommandLineCodeExecutor
code_executor = DockerCommandLineCodeExecutor(
image="python:3.11-slim",
timeout=15, # seconds
mem_limit="512m", # memory cap
cpu_period=100000,
cpu_quota=50000, # ~0.5 CPU
network_disabled=True, # no outbound network from the container
read_only=True, # no write access to container filesystem
)
Key safety controls:
network_disabled=True– disables network inside the container (no egress by default).read_only=True– prevents writes; pair with a temporary volume if you need ephemeral storage.- Resource limits – protect your host from runaway CPU/memory usage.
3. Build an AgentChat AssistantAgent with the tool
Now expose this executor as a tool to an agent.
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.models import OpenAIChatCompletionClient
from autogen_agentchat.tools import Tool
model_client = OpenAIChatCompletionClient(model="gpt-4o-mini")
# Wrap the code executor as a tool
def run_python_in_docker(code: str) -> str:
result = code_executor.execute(code, language="python")
return result.output
docker_tool = Tool(
name="run_python_in_docker",
description="Execute Python code in a sandboxed Docker container with no network access.",
func=run_python_in_docker,
)
assistant = AssistantAgent(
name="sandboxed_coder",
model_client=model_client,
tools=[docker_tool],
)
4. Run a simple task
from autogen_agentchat.teams import SelectorGroupChat
team = SelectorGroupChat(
name="safe_code_team",
agents=[assistant], # can add more agents later
model_client=model_client,
)
result = team.run("Write Python code to compute the first 10 Fibonacci numbers and run it.")
print(result.messages[-1].content)
print("Stop reason:", result.stop_reason)
Pay attention to TaskResult.stop_reason to understand why the task finished—for example, a timeout or tool error vs. user stop.
Common Mistakes to Avoid
-
Letting agents execute code on the host OS directly:
How to avoid it: Always use a sandboxed executor likeDockerCommandLineCodeExecutor. Do not register tools that callsubprocessdirectly unless they themselves enforce isolation. -
Forgetting to lock down network/egress inside containers:
How to avoid it: Start withnetwork_disabled=True. If you must allow network, use a custom Docker network and firewall rules to limit accessible hosts. Never mount cloud SDK credentials or unrestricted kubeconfig into the container. -
Mounting sensitive host directories into containers:
How to avoid it: Use ephemeral volumes for scratch space. Avoid mounting$HOME,/var, or any directory that may contain secrets or internal configs. -
Skipping log monitoring and human-in-the-loop checks in early stages:
How to avoid it: Capture and review logs for every tool invocation. For higher-risk tools (like shell), require explicit human approval for execution in early iterations.
Real-World Example
In our regulated environment, we started with a “code assistant” that could both reason about Python and execute small code snippets against synthetic data. The first prototypes used direct Python execution, which quickly became unmanageable from a risk perspective.
We migrated to:
- A standalone runtime (
SingleThreadedAgentRuntime) running AgentChat agents. - A
DockerCommandLineCodeExecutorconfigured with:network_disabled=Trueand a tight CPU/memory limit.- A read-only image with only
pythonand a few whitelisted libraries.
- A message filter layer to strip out any attempts to request host paths, credentials, or network access before passing prompts to the agent.
The same agent definitions now run unchanged in a distributed runtime topology (host servicer + workers + gateways) for multi-tenant workloads; we just moved the execution to worker nodes with more CPU and attached per-tenant Docker hosts. Because the security boundaries lived at the runtime layer and in the executor configuration, we didn’t have to rewrite the agent logic or tool contracts to scale up.
Pro Tip: Treat your Docker executor configuration as code (version-controlled) and test it like you would test a firewall rule: start with “deny all” (no network, no mounts) and only add explicit capabilities when a use case justifies it.
Summary
Tool-using agents only become safe when you pair them with a runtime that enforces isolation and strict execution policies. AutoGen’s layered stack—Core runtime, AgentChat agents, and DockerCommandLineCodeExecutor from autogen-ext—gives you a concrete, code-first way to:
- Run agent-authored code in containers instead of on your host
- Apply resource limits and network/egress controls consistently
- Migrate from local to distributed runtimes without rewriting your agents
Focus first on the runtime and sandbox choices (topics, runtimes, and executors), then on prompts and models.