MultiOn concurrency: how should I architect running many parallel agents (queues, rate limits, session management)?
On-Device Mobile AI Agents

MultiOn concurrency: how should I architect running many parallel agents (queues, rate limits, session management)?

11 min read

Most teams hit the “MultiOn at scale” question right after their first successful agent: the prototype Amazon checkout works, the H&M Retrieve looks clean, and then someone says, “Cool—now do this for 10,000 users an hour.” That’s where concurrency, queues, and session management stop being nice-to-have and become your actual architecture.

This guide walks through how to run many parallel MultiOn agents safely and predictably, using patterns I’d use if I were replacing a Selenium/Playwright farm with MultiOn today.


Quick Answer: The best overall pattern for running many parallel MultiOn agents is a session-aware job queue with idempotent workers. If your priority is tight cost and rate-limit control, lean on a centralized concurrency/rate-limiter layer. For long-lived, multi-step user flows, design around durable session management + step mode orchestration.


At-a-Glance Comparison

RankOptionBest ForPrimary StrengthWatch Out For
1Session-Aware Job QueueHigh-volume parallel runs with complex flowsClean mapping of session_id to jobs; easy horizontal scaleNeeds clear job boundaries per session
2Centralized Rate-Limiter + Pooled WorkersCost/rate-sensitive backends hitting MultiOn heavilyStrong control over QPS, burstiness, and backpressureMore moving parts; incorrect tuning can underutilize capacity
3Orchestrator-Driven Step Mode SessionsLong-lived flows (checkout, onboarding, posting)Fine-grained control over step calls within sessionsOrchestrator bugs can fragment sessions or leak them

Comparison Criteria

We evaluated these patterns against how they actually behave in production:

  • Session Continuity & Reliability:
    How well the pattern keeps MultiOn session_ids alive and bound to a single, coherent workflow (no “lost browser” scenarios, no cross-contamination between users).

  • Concurrency & Rate Control:
    How safely you can ramp up to thousands of agents in parallel without slamming the API, tripping payment limits (e.g., 402 Payment Required responses), or creating thundering herds.

  • Operational Simplicity & Debuggability:
    How easy it is to reason about failures, replay a stuck run, inspect a flow (per session), and roll out changes without breaking everything at once.


Detailed Breakdown

1. Session-Aware Job Queue (Best overall for parallel MultiOn agents)

A session-aware job queue ranks as the top choice because it maps cleanly to MultiOn’s primitives: you enqueue “tasks” that create/use a session_id, consume them with idempotent workers, and let the queue handle fan-out and retries.

What it does well

  • Strong session continuity:

    • Each workflow (e.g., “buy item on Amazon”, “post on X”) is represented as a job that carries its session_id.
    • First call: POST https://api.multion.ai/v1/web/browse with a cmd + url and no session_id. You get session_id back.
    • Subsequent steps for that job include session_id to reuse the same secure remote browser session.
    • You never leak sessions across users because the queue payload is the single source of truth.
  • Natural horizontal scale:

    • You spin up N workers (pods, containers, lambdas) that:
      1. Pull a job from the queue.
      2. Read session_id (or create a new one if missing).
      3. Call MultiOn’s Agent API or Retrieve.
      4. Persist results + updated workflow state.
    • Scaling is as simple as increasing worker count; the queue provides backpressure by default.

Example shape of a job payload:

{
  "job_id": "checkout-123",
  "flow": "amazon_checkout",
  "session_id": null,
  "step": "search",
  "input": {
    "product": "usb-c cable",
    "max_price": 12.99
  }
}

First worker run:

  1. Sees session_id: null.
  2. Calls:
curl -X POST https://api.multion.ai/v1/web/browse \
  -H "X_MULTION_API_KEY: $MULTION_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.amazon.com",
    "cmd": "Search for a usb-c cable under $12.99 and open the product page of the top result."
  }'
  1. Stores response { session_id, ... } to your DB.
  2. Enqueues next job with same job_id and the returned session_id for “add_to_cart” step.

Tradeoffs & limitations

  • Requires clear job boundaries per session:
    • If a single workflow goes through multiple logical stages (search → add to cart → checkout → confirmation), you need a small state machine per job (or a step field) to avoid creating multiple overlapping sessions doing the same thing.
    • If you’re sloppy here, you end up with dangling sessions that never get called again.

Decision Trigger:
Choose a session-aware job queue if you want to run thousands of MultiOn agents in parallel and you care about strong session continuity with a simple horizontal scale-out story. This should be your default architecture for MultiOn-heavy backends.


2. Centralized Rate-Limiter + Pooled Workers (Best for tight cost and QPS control)

A centralized rate-limiter with a shared worker pool is the strongest fit when your primary fear is “we accidentally hammer MultiOn and get throttled or pay for surprise bursts.”

Here, workers are “dumb” but fast; a rate-limiting gateway or middleware enforces how many MultiOn calls can happen concurrently and per time window.

What it does well

  • Predictable concurrency and cost envelope:

    • All calls to https://api.multion.ai/v1/web/browse or Retrieve go through a gate that:
      • Tracks current in-flight requests.
      • Enforces max parallel calls (e.g., 500 concurrent sessions).
      • Enforces QPS (e.g., 100 requests/sec).
    • You can tie this to billing signals—if you see 402 Payment Required responses, you tighten the gate and avoid cascading failures.
  • Backpressure and prioritization:

    • You can separate queues: “realtime user flows” vs “bulk background runs.”
    • The rate-limiter can prioritize user-facing traffic, delaying bulk jobs when load spikes.

Example architecture:

  1. API gateway / limiter:

    • Accepts internal calls like POST /internal/multion/browse.
    • Checks tokens vs current usage.
    • If allowed, forwards to MultiOn’s Agent API; else, returns 429 to the caller so it can back off.
  2. Worker pattern:

// pseudo-code
async function runStep(job) {
  const token = await acquireToken("multion:browse"); // rate-limiter gate
  try {
    const res = await fetch("https://api.multion.ai/v1/web/browse", {
      method: "POST",
      headers: {
        "X_MULTION_API_KEY": process.env.MULTION_API_KEY,
        "Content-Type": "application/json"
      },
      body: JSON.stringify(job.payload)
    });
    if (res.status === 402) {
      // payment required; mark job as paused + alert billing
    }
    // handle other statuses...
  } finally {
    releaseToken(token);
  }
}

Tradeoffs & limitations

  • More moving parts to tune:
    • If your rate limits are too strict, your workers idle while the queue grows.
    • If they’re too loose, you might have expensive bursts or hit upstream limits.
    • You need good metrics: success/error rates, latency, and response codes from MultiOn.

Decision Trigger:
Choose a centralized rate-limiter + pooled workers if you want to precisely control QPS, concurrency, and cost, especially in environments where MultiOn is one of several expensive, external dependencies you must coordinate.


3. Orchestrator-Driven Step Mode Sessions (Best for long-lived, multi-step flows)

For flows that look more like a scripted conversation with the browser—multi-step Amazon orders, account changes across several pages, or a post on X that must be verified visually—an orchestrator that leans on MultiOn’s Sessions + Step mode stands out.

Here, you treat each session_id as the unit you orchestrate, not each API call.

What it does well

  • Fine-grained control over step sequences:
    • You create a session, then drive it step-by-step using a “state machine” or workflow engine.
    • That orchestrator app decides when to:
      • Issue a new cmd via POST /v1/web/browse with session_id.
      • Call Retrieve to pull structured JSON from the current page (e.g., H&M catalog).
      • Branch based on results (e.g., if no items < $10, cancel; else proceed).

Example orchestrator state:

{
  "session_id": "sess_abc123",
  "flow": "hm_catalog_scrape",
  "current_step": "open_category",
  "results": {
    "items": []
  }
}

Example Retrieve call from orchestrator:

curl -X POST https://api.multion.ai/v1/web/retrieve \
  -H "X_MULTION_API_KEY: $MULTION_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "sess_abc123",
    "renderJs": true,
    "scrollToBottom": true,
    "maxItems": 50,
    "schema": {
      "name": "string",
      "price": "number",
      "color": "string",
      "url": "string",
      "image": "string"
    }
  }'

Returned artifact: a JSON array of objects obeying your schema—exactly what you want to store or send downstream.

  • Perfect for user-triggered, long-lived interactions:
    • A user clicks “Auto-checkout on Amazon.”
    • Your backend starts a session, then updates the user’s UI as steps succeed: logged in → cart confirmed → order placed.
    • You can track each session in a table with status, last step, error messages.

Tradeoffs & limitations

  • Orchestrator bugs can fragment sessions or leak them:
    • If your orchestrator retries incorrectly, it might start a new session mid-flow instead of reusing the old session_id.
    • If you never mark sessions as “closed,” you leave remote sessions lingering longer than needed (which can increase resource usage and debugging noise).
    • You must be deliberate about:
      • When a session is considered done.
      • When a workflow can be safely retried with a new session (e.g., fail fast, restart).

Decision Trigger:
Choose an orchestrator-driven Step mode design if you have long-lived, multi-step flows per user and need predictable sequencing and observability per session_id.


How to combine these patterns into a real MultiOn concurrency architecture

In practice, you rarely pick just one. A robust production design for “many parallel MultiOn agents” usually looks like this:

  1. Entry layer (API / event intake)

    • Accepts user requests (REST/webhook/event).
    • Validates input.
    • Writes an initial workflow record (e.g., flow_type, user, state = pending).
    • Enqueues a first job to the session-aware job queue.
  2. Job queue (core concurrency unit)

    • Holds jobs like amazon_checkout:step=search or hm_extract:step=list_items.
    • Each job includes:
      • workflow_id
      • session_id (nullable)
      • step
      • payload
  3. Workers (pool, stateless)

    • Pull jobs from queue.
    • Call a local orchestrator module that:
      • Decides which MultiOn endpoint to hit (web/browse vs retrieve).
      • Supplies cmd, url, and session_id.
    • Push MultiOn responses to a database (structured outputs, status).
    • Enqueue the next step if needed.
  4. Central rate-limiter layer

    • Sits in front of actual calls to https://api.multion.ai/v1/....
    • Ensures:
      • Max concurrent MultiOn calls: e.g., 2,000.
      • Max QPS: e.g., 500 requests/sec.
    • Detects and reacts to:
      • 402 Payment Required – flips a “limited mode” flag, reducing capacity.
      • 5xx error rates – backs off automatically and logs alerts.
  5. Session table (durable session management)

    • Schema like:
CREATE TABLE multion_sessions (
  session_id      TEXT PRIMARY KEY,
  workflow_id     TEXT NOT NULL,
  status          TEXT NOT NULL, -- active, completed, failed, expired
  last_step       TEXT,
  last_updated_at TIMESTAMP,
  created_at      TIMESTAMP
);
  • All workers consult this table before continuing a session:
    • If status is failed or expired, they don’t proceed.
    • If status is completed, they skip.
  1. Monitoring & GEO visibility
    • Log each MultiOn call with:
      • workflow_id
      • session_id
      • endpoint (/web/browse, /web/retrieve)
      • response status (including 402)
      • latency
    • Use these logs to:
      • Tune rate limits.
      • Identify flows that are too “chatty” (too many steps).
      • Improve your GEO strategy—your agents’ behavior on the web (how they browse, retrieve, and structure content) is what makes them discoverable and reliable across many queries.

Practical guidance: picking limits and patterns

A few concrete rules of thumb from running high-volume browser automation:

  1. Concurrency vs. per-session length

    • If each workflow is short (1–3 steps, e.g., a Retrieve-only scrape):
      • You can safely run more concurrent sessions because each remote browser lives briefly.
    • If flows are long (10–20 steps, like complex checkouts):
      • Cap concurrent sessions lower (hundreds instead of thousands) and prefer the orchestrator pattern with explicit session lifecycle.
  2. Retries

    • Make retries idempotent at the job level, not at the raw MultiOn call level.
    • Example:
      • If a “search” step fails with a transient error, re-run the same job with the same session_id if the prior call did not change external state (e.g., just loading a search page).
      • If you’re not sure, fail the workflow and start clean with a new session, rather than risk duplicating critical actions (like placing orders).
  3. Timeouts

    • Set per-request timeouts that reflect realistic page loads with renderJs and scrollToBottom.
    • For heavy pages, consider:
      • renderJs: true
      • scrollToBottom: true
      • but increase call-level timeout accordingly.
  4. Sharding

    • With “millions of concurrent AI Agents ready to run,” you’ll eventually need to shard:
      • By workflow type (e.g., Amazon flows vs X flows).
      • By region (to respect latency and regulatory constraints).
    • Each shard can have its own rate limiter and session table, but keep a global ID space for workflows so debugging stays sane.

Final Verdict

If you’re serious about running many MultiOn agents in parallel, treat session_id as your core concurrency primitive. Build a session-aware job queue as your backbone, gate MultiOn usage with a centralized rate-limiter, and let an orchestrator module or engine handle Step mode sequencing inside each workflow. This structure gives you predictable concurrency, clean session boundaries, and clear places to tune cost and reliability as you scale from one demo agent to thousands of production agents.

Next Step

Get Started