We built a demo agent that works locally—what usually breaks when you roll it out to real users?
AI Agent Trust & Governance

We built a demo agent that works locally—what usually breaks when you roll it out to real users?

8 min read

Most agent projects look great in a local demo and then fall apart the moment you hand them to real users. The model still reasons fine—but auth flows, permissions, and tools start to buckle as soon as you leave the happy path.

Quick Answer: Most “it worked in the demo” agent failures come from auth and identity (service accounts vs real users), brittle tool schemas, missing permission gates, and zero governance once the agent starts taking actions in Gmail, Calendar, Slack, GitHub, or Salesforce. The model isn’t the problem—the runtime between AI and action is.

Frequently Asked Questions

What usually breaks when we move an agent from a local demo to real users?

Short Answer: Auth, permissions, and tools—not the model—are what usually break when you scale a local agent demo to real users.

Expanded Explanation:
In a local demo, you can get away with a single API key, a handful of hand-crafted prompts, and a service account that has god-mode access to everything. Once you onboard real users, that setup falls apart. You suddenly need to handle OAuth flows, per-user permissions, token refresh, rate limits, and wildly different edge cases in each tool call.

What actually fails in production isn’t the LLM’s reasoning—it’s the infrastructure around it: brittle service-account bots, tokens that silently expire, tools with vague schemas the model can’t reliably use, and no way for security or IT to see what the agent did on behalf of which user. That’s the gap an MCP runtime like Arcade is designed to fill: the runtime between AI and action, with user-specific auth, agent-optimized tools, and governance built in.

Key Takeaways:

  • The model is rarely the bottleneck; auth, permissions, and tool reliability are.
  • Service-account bots, ad-hoc OAuth, and thin API wrappers don’t survive real multi-user usage.

What are the main failure modes when we try to roll an agent out to real users?

Short Answer: The big failure modes are broken auth/identity, permission mismatches, flaky tool calls, and zero observability or governance once agents start acting in real systems.

Expanded Explanation:
In a local environment, you’re typically using your own credentials, a single tenant, and known data. You control the inputs, so everything looks clean. In the real world, you need the agent to act for hundreds or thousands of users, each with different scopes, org structures, and security policies.

Common failure modes:

  • Service accounts pretending to be users. Your demo uses a single Google or Slack service account with broad scopes. In production, that’s a non-starter: it can see mailboxes it shouldn’t, bypasses DLP, and doesn’t match actual user permissions.
  • Broken OAuth and refresh token handling. Tokens expire, consent is revoked, tenants get reconfigured. If your agent stack owns OAuth itself, you end up debugging 401s and refresh logic instead of building features.
  • Vague, underspecified tools. “Do everything” tools with fuzzy schemas confuse the model. You get unpredictable calls, wrong parameters, and timeouts against APIs like Gmail, Calendar, or Salesforce.
  • No permission gates or audit trail. Security teams see a black box that can send email and change records without clear authorization or logs. That’s where rollouts get blocked.

Arcade’s MCP runtime addresses these directly: agents act with user-specific permissions (not service accounts), tools are designed and evaluated for agent reliability, and every action runs through a governed runtime with audit logs and RBAC.

Steps:

  1. Map identity and auth flows: Decide how the agent will authenticate users (existing OAuth and IDP flows, SSO/SAML, SCIM).
  2. Replace service accounts with per-user auth: Use scoped OAuth and just-in-time authorization so the agent acts as the real user, not a shared bot.
  3. Harden tools for agent use: Move from thin API wrappers to agent-optimized tools with clear schemas, well-defined side effects, and consistent error handling—then execute them through a runtime like Arcade.

What’s the difference between a local demo agent and a production-ready multi-user agent?

Short Answer: A local demo is single-user, service-account, and prompt-driven; a production-ready agent is multi-user, permissioned, and runs through a dedicated MCP runtime with real auth and governance.

Expanded Explanation:
A local demo is optimized for “wow” in a controlled environment:

  • Single set of credentials hard-coded or stored in .env.
  • One-off API wrappers where the model “figures it out.”
  • No separation between experimentation and execution.

A production agent has to survive chaos:

  • Hundreds of users, each with different domains, groups, and scopes.
  • Multiple business systems (Gmail, Google Calendar, Slack, GitHub, HubSpot, Salesforce, Linear) each with their own rate limits and quirks.
  • Security and compliance teams who need tenant isolation, SSO/SAML, RBAC, and audit logs.

That’s where an MCP runtime like Arcade matters. Instead of baking all this into your app, you let the runtime handle OAuth, token storage/refresh, permissioning, and tool execution. Your agent “asks” to run tools like Google.SendEmail or Google.CreateEvent, and Arcade handles the safe execution with user-specific permissions and zero token exposure to the LLM.

Comparison Snapshot:

  • Option A: Local demo agent
    • Hard-coded tokens, service account, thin API wrappers, no governance.
  • Option B: Production-ready agent on an MCP runtime
    • Per-user auth, scoped OAuth and IDP flows, agent-optimized tools, observability and control.
  • Best for: Real-world agents acting across Gmail, Calendar, Slack, GitHub, Salesforce, and more, where security and reliability matter as much as model quality.

How do we actually harden our demo agent so it doesn’t break at rollout?

Short Answer: You harden the agent by externalizing auth into a runtime, upgrading tools from API wrappers to agent-optimized MCP tools, and adding permission gates and observability before you scale users.

Expanded Explanation:
The path from “cool demo” to “production agent” is mostly about replacing implicit assumptions with explicit infrastructure. Rather than letting the LLM “decide” everything via prompts, you enforce permissions and tool boundaries in code.

With Arcade, the process looks like:

  • You keep your agent logic where you like (Cursor, Claude, LangGraph, custom stack).
  • You integrate the Arcade SDK to handle auth.start(...) and wait_for_completion flows so real users connect their Google, Slack, or Salesforce accounts.
  • Your agent uses MCP tools like Gmail.ListEmails, Google.CreateEvent, or Slack.PostMessage exposed by Arcade, not raw API calls. Credentials and tokens never pass through the model—Arcade injects them at execution time.
  • Every tool call goes through permission gates and is logged for audit.

This lets you ship multi-user agents that send email, schedule meetings, and update CRMs with the right identity model and governance from day one.

What You Need:

  • A runtime layer (like Arcade) between the model and your business systems to handle OAuth, token management, and scoped execution.
  • Agent-optimized MCP tools with clear schemas instead of ad-hoc API wrappers, plus permission gates and logging on every action.

Strategically, what should we fix first so our agents don’t stall after the demo?

Short Answer: Fix authorization and tools first: move to user-specific permissions and an MCP runtime, then invest in clear, reliable tool schemas before you tune prompts or swap models.

Expanded Explanation:
If you’ve already proven the model can reason, the next bottleneck isn’t “try a bigger model”—it’s fixing the missing runtime layer. Most enterprise agent projects stall because they try to scale on a foundation of service accounts, bespoke OAuth handling, and thin wrappers around APIs. That’s like building a Ferrari that can only drive in your driveway.

Strategically, the highest-leverage shifts are:

  • Enforce authorization in code, not prompts. Permission gates, scoped OAuth, RBAC, and audit trails—so you don’t rely on “please don’t do X” in the system prompt.
  • Invest in tools, not just prompts. Our own benchmarks across models show the same pattern: vague tool schemas make every model fail; descriptive, agent-optimized tools make every model pass. Your agents are only as good as your tools.
  • Standardize on an MCP runtime. This gives you a consistent way to plug in tools, manage auth, and deploy across environments (cloud, VPC, on-prem, air-gapped) without reinventing the stack per agent.

This is exactly the layer Arcade focuses on: Secure Agent Authorization plus Agent-Optimized Tools, wrapped in an MCP runtime your security team can live with and your developers won’t have to babysit at 2 a.m.

Why It Matters:

  • Impact 1: You turn one-off demos into production agents that can safely take real actions—sending email, scheduling meetings, updating CRMs—at scale.
  • Impact 2: You avoid a rewrite later by getting auth, tools, and governance right up front, so new agents and use cases can reuse the same runtime instead of rebuilding OAuth and permissions.

Quick Recap

Local demo agents usually “break” not because the LLM stops working, but because the infrastructure around it is missing: user-specific auth, scoped permissions, reliable tools, and governance. Service accounts, ad-hoc OAuth, and thin API wrappers are fine for a one-user demo, but they don’t survive real multi-user rollouts. An MCP runtime like Arcade sits between AI and action so your agents can take safe, auditable actions across Gmail, Calendar, Slack, GitHub, Salesforce, and more—using real user permissions and agent-optimized tools.

Next Step

Get Started