AutoGen distributed runtime: how do I set up a gRPC host + workers (GrpcWorkerAgentRuntime) and scale out safely?
AI Agent Automation Platforms

AutoGen distributed runtime: how do I set up a gRPC host + workers (GrpcWorkerAgentRuntime) and scale out safely?

8 min read

Quick Answer: Use AutoGen Core’s distributed runtime with a gRPC host servicer and GrpcWorkerAgentRuntime workers: the host owns connections and routing, each worker runs agents and advertises them to the host. You scale safely by treating the host as your control plane, using topics/subscriptions instead of hard‑coded IDs, isolating tenants by topic source, and monitoring TaskResult(stop_reason=...) plus runtime logs for correctness.

Why This Matters

Single-process, single-agent prototypes don’t survive contact with real workloads: you need cross-machine agents, tenant isolation, and backpressure without rewriting your agents. AutoGen’s distributed runtime does this by separating “agent logic” from “where it runs,” using a host servicer + workers model so you can add capacity, swap runtimes, or segment tenants without touching business logic. If you set up gRPC workers and routing correctly, you can move from a SingleThreadedAgentRuntime on your laptop to a multi-node system that still uses the same Agent implementations.

Key Benefits:

  • Scale-out without re‑writing agents: The same agent code runs on standalone or distributed runtimes; switching is a deployment decision, not a refactor.
  • Stronger safety and isolation: The host enforces communication and lifecycle boundaries so you can segment tenants and tools instead of relying on ad‑hoc agent IDs.
  • Operational observability: A distributed, event-driven runtime surfaces agent events, TaskResult outcomes, and routing behavior so you can debug throttling, failures, and message storms.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Distributed Agent RuntimeA runtime topology with a host servicer and one or more workers connected via gateways over gRPC.Lets agents run across processes/machines/languages while preserving a uniform API.
GrpcWorkerAgentRuntimeA worker-side runtime implementation that connects agents to the host servicer via gRPC gateways.This is the piece you scale out horizontally; each instance runs agents and advertises them to the host.
Topic / SubscriptionTopic = (Topic Type, Topic Source) with string form Topic_Type/Topic_Source; subscriptions like TypeSubscription(topic_type="default", agent_type="triage_agent").Routing by topic + type keeps agent logic portable and makes it easier to isolate tenants and control message fan-out.

How It Works (Step-by-Step)

At a high level, the AutoGen distributed runtime splits responsibilities:

  • Host servicer (Core)

    • Maintains connections to workers via gRPC gateways.
    • Tracks which agents live where and which topics/subscriptions they handle.
    • Routes messages between agents using topics and subscriptions.
    • Manages lifecycle events (start, stop, reconnect).
  • Workers (Core + Extensions)

    • Use GrpcWorkerAgentRuntime to connect to the host.
    • Instantiate agents (AgentChat or Core agents) and tools locally.
    • Advertise those agents and their subscriptions back to the host.

Because the runtime API is consistent, you can start with a standalone runtime like SingleThreadedAgentRuntime and later change your bootstrap code to register agents in a GrpcWorkerAgentRuntime instead—your agent implementations don’t change.

1. Installation

You need Python 3.10+ and at least Core; AgentChat and Extensions are recommended for most applications:

pip install -U "autogen-core" "autogen-agentchat" "autogen-ext[openai]"

If you plan to use Azure OpenAI or other providers, add the relevant extras (for example, autogen-ext[azure]).

Note: The distributed runtime uses gRPC under the hood. Ensure outbound connectivity between host and workers on your chosen ports and secure it (mTLS, network policies, or both) in production.

2. Start a Host Servicer

In a minimal setup, the host is just a process that:

  • Creates an agent runtime in “host mode”.
  • Opens a gRPC server endpoint for worker gateways.
  • Optionally hosts agents itself (but in most deployments, the host is a control plane only).

Conceptually:

# host.py
import asyncio
from autogen_core import (
    # Names here are illustrative; consult the current docs/API for exact classes
    GrpcAgentHostRuntime,  # host-side runtime
)

HOST_ADDRESS = "0.0.0.0:50051"

async def main():
    # Create a host runtime listening for gRPC worker gateways.
    host_runtime = GrpcAgentHostRuntime(bind_address=HOST_ADDRESS)

    # Start serving. This call typically blocks until cancelled.
    await host_runtime.start()

if __name__ == "__main__":
    asyncio.run(main())

Run this in your “control plane” environment:

python host.py

Note: class names and constructor signatures may evolve; use the latest Autogen Core docs for the exact host runtime type and parameters. The topology—one host, multiple gRPC workers—remains the same.

3. Bring Up a GrpcWorkerAgentRuntime Worker

Each worker is a separate process (or container) that:

  • Connects to the host’s gRPC endpoint.
  • Starts a GrpcWorkerAgentRuntime.
  • Registers agents and their subscriptions.

A minimal worker that hosts a single AgentChat AssistantAgent might look like:

# worker.py
import asyncio
import os

from autogen_core import GrpcWorkerAgentRuntime
from autogen_agentchat import AssistantAgent, UserProxyAgent
from autogen_ext.openai import OpenAIChatCompletionClient

HOST_ADDRESS = os.environ.get("AUTOGEN_HOST_ADDRESS", "localhost:50051")
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]  # required

async def main():
    # Worker runtime that connects back to the host via gRPC
    worker_runtime = GrpcWorkerAgentRuntime(host_address=HOST_ADDRESS)

    # Model client from autogen-ext
    model_client = OpenAIChatCompletionClient(
        model="gpt-4o-mini",
        api_key=OPENAI_API_KEY,
    )

    # Example: a simple assistant agent
    assistant = AssistantAgent(
        name="support_assistant",
        model_client=model_client,
        # This agent handles messages on topic type "default"
        subscriptions=[
            # Example subscription accepting any source on type "default"
            # Use TypeSubscription from Core for stronger typing.
        ],
        runtime=worker_runtime,
    )

    # Optional: a proxy agent for testing
    user = UserProxyAgent(
        name="local_user",
        runtime=worker_runtime,
    )

    # Register agents with the runtime (this advertises them to the host)
    await worker_runtime.register_agent(assistant)
    await worker_runtime.register_agent(user)

    # Start processing messages; this usually blocks until shutdown.
    await worker_runtime.start()

if __name__ == "__main__":
    asyncio.run(main())

Run this worker pointing at your host:

export OPENAI_API_KEY=sk-...
export AUTOGEN_HOST_ADDRESS="host-servicer.my-cluster:50051"
python worker.py

The worker uses GrpcWorkerAgentRuntime to register its agents with the host; the host now knows that messages destined for support_assistant (or its topics) should be routed to this worker.

Note: In a typical production setup, you won’t run UserProxyAgent on a worker; instead, you’ll have front-end services talking to the host runtime via an API layer. The above is intentionally minimal for testing.

4. Route Messages by Topic and Subscription

Core’s routing model is topic-based:

  • Topic = (Topic Type, Topic Source)

    • Example: (topic_type="default", topic_source="tenant_a")
    • String form: "default/tenant_a"
  • Subscriptions describe what messages an agent wants to receive. A common pattern is a TypeSubscription binding an agent type (or name) to a topic type.

Example conceptually:

from autogen_core import TypeSubscription

assistant = AssistantAgent(
    name="support_assistant",
    model_client=model_client,
    # This assistant receives messages where topic_type="support"
    subscriptions=[
        TypeSubscription(topic_type="support", agent_type="support_assistant"),
    ],
    runtime=worker_runtime,
)

On the sending side, you create a message with a topic:

from autogen_core import Topic, TextMessage

topic = Topic(topic_type="support", topic_source="tenant_a")  # "support/tenant_a"

await worker_runtime.send(
    TextMessage(
        content="Help me troubleshoot a login error.",
        topic=topic,
        sender="local_user",
    )
)

The host uses subscriptions to route this message to any worker that has a matching agent. This is why I strongly prefer topics and TypeSubscription over hard-coded agent IDs: it decouples your routing from where agents live and makes scale-out much less brittle.

5. Observe Outcomes with TaskResult

When you build higher-level workflows (AgentChat Teams, GraphFlow, or your own orchestrators), you’ll typically await a TaskResult:

from autogen_agentchat import AssistantAgent
from autogen_agentchat.task import run_agent_task

result = await run_agent_task(
    agent=assistant,
    messages=[{"role": "user", "content": "Summarize this incident log."}],
)

print(result.stop_reason)
for m in result.messages:
    print(m["role"], ":", m["content"])

TaskResult(stop_reason=...) is your safety valve: inspect it to detect timeouts, max-turn limits, tool errors, or aborted workflows. In a distributed runtime, always log stop_reason alongside the topic and tenant so you can correlate failures back to specific workers and workloads.


Common Mistakes to Avoid

  • Treating the host like a “just another worker”:
    The host is your control plane and routing authority; running heavy models or tools directly on the host complicates scaling and failover. Keep the host lightweight and stateless where possible and put CPU/GPU work on workers.

  • Hard-coding agent IDs instead of using topics/subscriptions:
    Directly addressing “agent X on worker Y” breaks as soon as you add workers or split tenants. Use Topic(topic_type, topic_source) plus TypeSubscription and let the host route; this makes multi-tenant isolation and resharding possible without changing code.


Real-World Example

In our regulated environment, we started with a SingleThreadedAgentRuntime hosting a simple triage + resolver pair of agents behind an internal API. That worked until we needed to:

  • Serve multiple business units with different compliance rules.
  • Run heavy retrieval and code-execution tools for only some tenants.
  • Isolate noisy tenants so they couldn’t starve critical workloads.

We introduced the distributed runtime as follows:

  1. Host servicer: Deployed on a small but highly available cluster with strict network ACLs. It doesn’t run any models; it only manages connections, topics, and lifecycle.
  2. Tenant-partitioned workers:
    • A “sensitive” worker pool with GrpcWorkerAgentRuntime connected to private data and a locked-down DockerCommandLineCodeExecutor.
    • A “standard” worker pool for general workflows with more aggressive concurrency.
  3. Topic-based routing:
    • Topic_Type expresses the workflow: "triage", "resolution", "reporting".
    • Topic_Source expresses the tenant: "tenant_a", "tenant_b".
    • We assign subscriptions per tenant and pool, e.g., sensitive triage agents subscribe to "triage/tenant_a" only on the sensitive worker pool.

As usage grew, scaling was as simple as adding more GrpcWorkerAgentRuntime pods to the standard pool and letting them register the same agent types; the host automatically fanned out load using its routing logic.

Pro Tip: Before adding more workers, add structured logging on the host for (topic_type, topic_source, agent_type, stop_reason) and sample message sizes. You’ll catch two real problems early: oversized contexts causing latency and certain tenants generating pathologically long conversations.


Summary

AutoGen’s distributed runtime gives you a clean separation between agent logic and execution topology: a gRPC host servicer manages routing and lifecycle, and GrpcWorkerAgentRuntime workers host agents wherever you have capacity. To scale out safely:

  • Keep the host lightweight and treat it as your control plane.
  • Use topics and TypeSubscription for routing instead of hard-coded IDs.
  • Partition workers by tenant/risk, and rely on topic source as a first-class isolation boundary.
  • Monitor TaskResult(stop_reason=...) and runtime logs to catch routing and capacity issues early.

You can start locally with a SingleThreadedAgentRuntime, then upgrade to a distributed topology without rewriting agents—the migration is in your bootstrap code and deployment manifests, not in your business logic.

Next Step

Get Started