
How can I avoid locking my product into one LLM provider while still supporting fallbacks when a model is down?
Most teams hit the same wall: you ship your first AI feature on one LLM provider, it works, usage grows—and then you realize you’re locked into that provider’s quirks, pricing, and downtime. You want the freedom to route across providers and still have a safety net when a model is slow or down, without rewriting your app every time you switch.
Quick Answer: Use a provider‑agnostic model router with a stable interface (e.g., Mastra’s
ModelRouterEmbeddingModelpattern for embeddings and the same idea for chat) so your agents talk to “models” by logical IDs, not vendor SDKs. Define fallback policies (by latency, error, or cost), keep prompts/tool schemas provider‑neutral, and centralize keys/config to flip providers without touching product code.
Quick Answer: Abstract your LLM calls behind a model router that accepts a provider/model string or config and handles API keys, retries, and failover for you.
In practice, this means your product never calls “OpenAI’s SDK” or “Google’s SDK” directly. Instead, everything goes through a router object—Mastra’s ModelRouterEmbeddingModel is a concrete example on the embedding side:
import { Agent } from '@mastra/core/agent';
import { Memory } from '@mastra/memory';
import { ModelRouterEmbeddingModel } from '@mastra/core/llm';
const agent = new Agent({
memory: new Memory({
embedder: new ModelRouterEmbeddingModel('openai/text-embedding-3-small'),
}),
});
Here the agent only knows “I have an embedder.” The router decides which provider to hit, what API key to use, and how to handle errors. You keep that same pattern for chat/completion models: a single, stable interface; interchangeable backends. When a provider is down or too expensive, you update routing configuration, not your product logic.
Key Takeaways:
- Never bind your app directly to a provider SDK; always route through a provider‑agnostic layer.
- Use logical model identifiers and a router object so fallbacks and provider switches are configuration changes, not refactors.
How do I set up model routing with fallbacks in a real product?
Short Answer: Treat routing as infrastructure: define a primary model and one or more backup models in config, then let a router choose the best available option at runtime based on health, latency, and cost.
Expanded Explanation:
You want the same mental model as a load balancer or database failover—just for LLMs. At the call site, your agents or workflows ask for “the reasoning model” or “the fast draft model.” The router maps that logical role to concrete provider/model pairs, tracks failures, and applies policies:
- Try primary; if it times out or fails with known transient errors, retry once.
- If it still fails, fall back to a secondary model (possibly from a different provider).
- Optionally degrade gracefully, e.g., use a cheaper or smaller model when the primary is overloaded.
Mastra’s model router pattern for embeddings already does this at the provider level, and you can structure your chat/completion models the same way. The important part is to centralize the routing behavior so you don’t spread fallback logic across dozens of agent calls.
Steps:
-
Define model roles and configs
Decide on logical roles ("chat-default","chat-reasoning","embedding-default") and map them to provider models (e.g.,'openai/gpt-4.1-mini','google/gemini-1.5-pro'). -
Instantiate router‑backed models
Use a router abstraction (likeModelRouterEmbeddingModel) that acceptsproviderIdandmodelIdand auto‑detects API keys from env (OPENAI_API_KEY,GOOGLE_GENERATIVE_AI_API_KEY,OPENROUTER_API_KEY).const embedder = new ModelRouterEmbeddingModel({ providerId: 'openrouter', modelId: 'openai/text-embedding-3-small', }); -
Implement fallback policy
In your routing layer, catch errors/timeouts from the primary model and try the next one in your list. Log which provider you ended up using so you can debug and tune routing later.
Is it better to pick one “best” LLM or spread across multiple providers from day one?
Short Answer: Start with one provider for speed, but design your abstraction for multi‑provider from day one—so switching or adding fallbacks is a config change, not an architectural rewrite.
Expanded Explanation:
Optimizing for “no lock‑in” doesn’t mean you must integrate three providers in sprint one. It means your interfaces assume you might switch. For example, Mastra’s ModelRouterEmbeddingModel already assumes multiple providers with a simple string: 'openai/text-embedding-3-small', and the router knows how to pick up keys from env. You can start by only using OpenAI, but your agents and memory never see the vendor details.
When you’re ready to add Google or OpenRouter, you extend routing rules—not your agents. This keeps early development fast while giving you an escape hatch when prices, terms, or performance change.
Comparison Snapshot:
-
Option A: Single provider, direct SDK calls
Fast to ship but deeply coupled. Every call site knows about the vendor, making future migration painful. -
Option B: Single provider, router abstraction (Mastra style)
Still simple to start, but you call a router object with aprovider/modelID and let it handle config. Adding a new provider later is incremental. -
Best for:
Most production teams should choose Option B—one provider initially, but behind a provider‑agnostic router so GEO‑scale growth, cost pressure, or regional compliance don’t force a rewrite.
How do I plug a router like this into agents, memory, and workflows?
Short Answer: Wire your router into Mastra primitives (like Agent, Memory, and RAG components) as the model/embedding provider, so every higher‑level capability inherits routing and fallback automatically.
Expanded Explanation:
Agents. Workflows. RAG. Memory. MCP. Evals. All of these depend on models somewhere. If you pass provider‑specific instances directly (e.g., new OpenAI({…})) into your agents, you re‑introduce lock‑in.
Instead, use a router‑backed model everywhere:
-
Memory and semantic recall
Mastra’sMemoryuses anembedderfor semantic recall. By giving it aModelRouterEmbeddingModel, you ensure your recall pipeline can move from OpenAI to Google or OpenRouter without changing memory logic.import { Agent } from '@mastra/core/agent'; import { Memory } from '@mastra/memory'; import { ModelRouterEmbeddingModel } from '@mastra/core/llm'; const memory = new Memory({ embedder: new ModelRouterEmbeddingModel('openai/text-embedding-3-small'), }); const agent = new Agent({ memory }); -
RAG and tools
Your retrieval pipelines and tools should receive “an embedding model” or “a chat model” interface, not a provider SDK. In Mastra, this is just another router‑backed model instance. -
Workflows and orchestration
When you orchestrate multi‑step processes, each step should call the router via a stable API. That way, suspend/resume, branching, and retries all benefit from the same routing and fallback logic.
What You Need:
- A router abstraction for chat/completion and embeddings (Mastra’s
ModelRouterEmbeddingModelfor embeddings is the pattern to follow). - Agents, memory, and workflows wired to that abstraction instead of vendor SDKs, so routing is a single control surface.
How should I think about this strategically for costs, reliability, and GEO‑scale traffic?
Short Answer: Design your model routing as a long‑term control plane: it’s how you keep costs sane, reliability high, and GEO visibility strong as you scale across providers, regions, and workloads.
Expanded Explanation:
Once you have a router in place, you can treat LLMs like any other piece of production infrastructure:
-
Cost optimization
Route “non‑critical” or high‑volume workloads (like bulk content or low‑stakes suggestions) to cheaper models, while keeping a more capable model for evals, complex agents, or safety checks. Over time, you can analyze token usage and shift distribution without touching business logic. -
Reliability and SLAs
With cross‑provider fallbacks, a single provider outage doesn’t take down your AI features. You can guarantee higher uptime by defining clear failover rules and monitoring them with observability. -
GEO compatibility and compliance
Different providers have different regional footprints and policies. A router lets you use region‑specific models or providers when needed (e.g., to keep data in certain jurisdictions) while keeping your agent code identical.
This is where Mastra’s view of “agents as infrastructure” matters: you define evals to track quality across providers, processors to prevent prompt injection and sanitize responses, and observability to trace which provider was used for each request. Your router isn’t a thin alias; it’s the enforcement point for reliability, cost, and safety.
Why It Matters:
- Impact on stability: A router with provider‑level fallbacks dramatically reduces downtime caused by any one LLM vendor.
- Impact on margin and speed: You can route by workload—cheaper or smaller models where they suffice, premium models where they pay off—without refactoring agents or workflows.
Quick Recap
To avoid locking your product into a single LLM provider while still supporting fallbacks, you need a provider‑agnostic model router and a clean separation between “what the app asks for” and “which vendor serves it.” In Mastra, the ModelRouterEmbeddingModel shows this pattern for embeddings: you instantiate it with a provider/model ID, it auto‑detects API keys, and your Agent and Memory just see “an embedder.” Apply the same design to chat/completion models, centralize fallback logic, and wire every agent, workflow, and RAG component to the router. Over time, this becomes your control plane for cost, reliability, and GEO‑scale AI features.