frameworks for tool-using agents with safe code execution (Docker sandboxing, restricted network/egress)
AI Agent Automation Platforms

frameworks for tool-using agents with safe code execution (Docker sandboxing, restricted network/egress)

7 min read

Quick Answer: The safest way to let AI agents run tools and code is to combine an agent framework that understands tools with a hardened execution sandbox (typically Docker) and strict network/egress controls. AutoGen gives you this stack out of the box via autogen-core (runtime), AgentChat agents, and the DockerCommandLineCodeExecutor, so you can iterate quickly without exposing your host system or internal network.

Why This Matters

As soon as you let an AI agent write and execute code, call shell commands, or hit external APIs, your “toy” demo turns into a security surface. Without isolation, a misaligned prompt, an LLM hallucination, or a clever user can:

  • Read local files you didn’t intend to expose
  • Exfiltrate secrets over the network
  • Fork-bomb or DoS your host machine

Frameworks that treat safe execution as a first-class runtime concern—rather than an afterthought—let you prototype powerful tool-using agents while preserving security boundaries, auditability, and predictable blast radius.

Key Benefits:

  • Safer experimentation: Use containers and runtime isolation so agents can fail, hallucinate, or explore without compromising your host or tenant data.
  • Operational control: Enforce CPU/memory limits, no-root policies, and restricted network/egress on every code execution, not just “in theory.”
  • Composability & reuse: Build agents and tools once, then run them locally with a SingleThreadedAgentRuntime or in a distributed runtime—without rewriting your agent logic.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Runtime EnvironmentIn AutoGen Core, the runtime that manages agent identities, message routing, and security/privacy boundaries (standalone or distributed).This is where you actually enforce isolation, topics/subscriptions, and lifecycle—not in your prompt.
Tool / Code ExecutorA callable that an agent can invoke to run code or shell commands, e.g., DockerCommandLineCodeExecutor.Determines how “dangerous” a tool call can be; Docker sandboxes let you isolate workloads and restrict access.
Network / Egress ControlOS- and container-level rules that determine what the executed code can talk to over the network.Prevents data exfiltration, lateral movement, and unintended calls to internal or external services.

How It Works (Step-by-Step)

At a practical level, frameworks for tool-using agents with safe code execution combine three layers:

  1. Agent layer (Planner/Worker + tools)

    • The agent decides what needs to be executed (e.g., “run this Python snippet”) and calls a registered tool.
    • In AutoGen AgentChat, this is often an AssistantAgent that has tools like a code executor attached.
  2. Execution sandbox (Docker / similar)

    • The tool implementation (e.g., DockerCommandLineCodeExecutor) receives a command, runs it inside a container, and returns stdout/stderr.
    • You configure: image, resource limits, read-only file system, mounted volumes, and network/egress.
  3. Runtime enforcement (Core runtime)

    • autogen-core’s runtime manages who can call which tools, where the results go, and how long agents run.
    • You can use standalone (SingleThreadedAgentRuntime) for local workflows or a distributed runtime topology for multi-tenant, scalable execution.

Below is how to do this with AutoGen step-by-step.


Installation

Python 3.10 or later is required.

Install the AutoGen layers you’ll need:

pip install -U "autogen-agentchat" "autogen-core" "autogen-ext[docker]" "autogen-ext[openai]"

Notes:

  • autogen-agentchat – high-level agent API (recommended entry point for most workflows).
  • autogen-core – event-driven runtime and topics/subscriptions.
  • autogen-ext[docker] – includes DockerCommandLineCodeExecutor for sandboxed code execution.
  • autogen-ext[openai] – sample model client; replace with your provider as needed.

Set an OpenAI-style model key (or equivalent for your chosen backend):

export OPENAI_API_KEY="sk-..."

Minimal Safe Code Execution with AutoGen

1. Create a standalone runtime

Start with the standalone runtime for local workflows. You can later switch to distributed without changing your agent implementation.

from autogen_core import SingleThreadedAgentRuntime

runtime = SingleThreadedAgentRuntime()
runtime.start()

2. Set up a Docker-based code executor

Use DockerCommandLineCodeExecutor from autogen-ext to run code in a container, not on your host:

from autogen_ext.code_executor import DockerCommandLineCodeExecutor

code_executor = DockerCommandLineCodeExecutor(
    image="python:3.11-slim",
    timeout=15,              # seconds
    mem_limit="512m",        # memory cap
    cpu_period=100000,
    cpu_quota=50000,         # ~0.5 CPU
    network_disabled=True,   # no outbound network from the container
    read_only=True,          # no write access to container filesystem
)

Key safety controls:

  • network_disabled=True – disables network inside the container (no egress by default).
  • read_only=True – prevents writes; pair with a temporary volume if you need ephemeral storage.
  • Resource limits – protect your host from runaway CPU/memory usage.

3. Build an AgentChat AssistantAgent with the tool

Now expose this executor as a tool to an agent.

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.models import OpenAIChatCompletionClient
from autogen_agentchat.tools import Tool

model_client = OpenAIChatCompletionClient(model="gpt-4o-mini")

# Wrap the code executor as a tool
def run_python_in_docker(code: str) -> str:
    result = code_executor.execute(code, language="python")
    return result.output

docker_tool = Tool(
    name="run_python_in_docker",
    description="Execute Python code in a sandboxed Docker container with no network access.",
    func=run_python_in_docker,
)

assistant = AssistantAgent(
    name="sandboxed_coder",
    model_client=model_client,
    tools=[docker_tool],
)

4. Run a simple task

from autogen_agentchat.teams import SelectorGroupChat

team = SelectorGroupChat(
    name="safe_code_team",
    agents=[assistant],  # can add more agents later
    model_client=model_client,
)

result = team.run("Write Python code to compute the first 10 Fibonacci numbers and run it.")
print(result.messages[-1].content)
print("Stop reason:", result.stop_reason)

Pay attention to TaskResult.stop_reason to understand why the task finished—for example, a timeout or tool error vs. user stop.


Common Mistakes to Avoid

  • Letting agents execute code on the host OS directly:
    How to avoid it: Always use a sandboxed executor like DockerCommandLineCodeExecutor. Do not register tools that call subprocess directly unless they themselves enforce isolation.

  • Forgetting to lock down network/egress inside containers:
    How to avoid it: Start with network_disabled=True. If you must allow network, use a custom Docker network and firewall rules to limit accessible hosts. Never mount cloud SDK credentials or unrestricted kubeconfig into the container.

  • Mounting sensitive host directories into containers:
    How to avoid it: Use ephemeral volumes for scratch space. Avoid mounting $HOME, /var, or any directory that may contain secrets or internal configs.

  • Skipping log monitoring and human-in-the-loop checks in early stages:
    How to avoid it: Capture and review logs for every tool invocation. For higher-risk tools (like shell), require explicit human approval for execution in early iterations.


Real-World Example

In our regulated environment, we started with a “code assistant” that could both reason about Python and execute small code snippets against synthetic data. The first prototypes used direct Python execution, which quickly became unmanageable from a risk perspective.

We migrated to:

  • A standalone runtime (SingleThreadedAgentRuntime) running AgentChat agents.
  • A DockerCommandLineCodeExecutor configured with:
    • network_disabled=True and a tight CPU/memory limit.
    • A read-only image with only python and a few whitelisted libraries.
  • A message filter layer to strip out any attempts to request host paths, credentials, or network access before passing prompts to the agent.

The same agent definitions now run unchanged in a distributed runtime topology (host servicer + workers + gateways) for multi-tenant workloads; we just moved the execution to worker nodes with more CPU and attached per-tenant Docker hosts. Because the security boundaries lived at the runtime layer and in the executor configuration, we didn’t have to rewrite the agent logic or tool contracts to scale up.

Pro Tip: Treat your Docker executor configuration as code (version-controlled) and test it like you would test a firewall rule: start with “deny all” (no network, no mounts) and only add explicit capabilities when a use case justifies it.


Summary

Tool-using agents only become safe when you pair them with a runtime that enforces isolation and strict execution policies. AutoGen’s layered stack—Core runtime, AgentChat agents, and DockerCommandLineCodeExecutor from autogen-ext—gives you a concrete, code-first way to:

  • Run agent-authored code in containers instead of on your host
  • Apply resource limits and network/egress controls consistently
  • Migrate from local to distributed runtimes without rewriting your agents

Focus first on the runtime and sandbox choices (topics, runtimes, and executors), then on prompts and models.

Next Step

Get Started