We’re launching LLM features—how do we change prompts/models safely in production and roll out to a small cohort first?

Launching LLM features should feel fast, but never reckless. You want to iterate on prompts and models in production, limit the blast radius if something goes wrong, and expand to more users only when the real-world signal says it’s safe—all without waiting on a redeploy.

This is exactly where runtime control and AI Configs come in: treat prompts, models, and agents like any other production release surface, not static code. Change behavior after deploy, target a small cohort first, observe live traffic, and roll forward (or back) instantly.

Quick Answer: Use AI Configs plus feature flags as your runtime control plane. Route traffic to new prompts/models behind flags, start with a small internal or customer cohort, observe LLM behavior in production, and progressively roll out while keeping instant rollback available—no redeploys required.

The Quick Overview

What It Is: A runtime control pattern for LLM apps using LaunchDarkly AI Configs and feature flags to change prompts, models, and agents safely in production.
Who It Is For: Teams shipping LLM features—product, ML, and platform engineers—who need to experiment on prompts/models in real traffic without risking a full production outage.
Core Problem Solved: You can’t safely ship AI if every prompt or model change requires a deploy and hits 100% of users at once. This setup isolates and limits risk while giving you the control surface to iterate quickly.

How It Works

Think of AI behavior as a configuration problem, not a code problem. Instead of baking prompts and model choices into your codebase, you externalize them into AI Configs and drive exposure via feature flags. That gives you a runtime control plane for:

Which prompt / model / agent graph is active
Which users or segments see the new behavior
How fast you expand from 1% to 100%
When to auto-pause or rollback if things regress

Under the hood, LaunchDarkly evaluates flags at runtime through 25+ native SDKs and pushes config changes globally in under 200ms, so you can flip between prompts/models in production without shipping a new build.

Externalize prompts, models, and variants into AI Configs:
Define prompts, model parameters, and agent graphs as AI Configs rather than hard-coded strings. Version them, name them, and keep a history of what changed.
Gate LLM behavior behind feature flags and cohorts:
Use feature flags as the routing layer for traffic. Target small internal cohorts first, then carefully selected customer segments, with fine-grained rules and percent rollouts.
Observe, experiment, and progressively roll out:
Monitor quality signals (LLM observability, online evals, guardrails) and business metrics. Use AI Experiments and progressive delivery to move from 1% to 100% safely—with instant rollback if anything deteriorates.

Features & Benefits Breakdown

Core Feature	What It Does	Primary Benefit
AI Configs (prompt, model, agent manager)	Centralizes prompts, model choices, parameters, and agent graphs as runtime configs.	Change AI behavior in production instantly and safely, without code changes or redeploys.
Feature flags & progressive rollouts	Routes users to different prompts/models via targeting rules and percent rollouts.	Start with a small cohort, limit blast radius, and expand in controlled steps.
AI Experiments & online evals	Compares AI variants using business metrics and LLM-as-a-judge quality signals.	Decide which prompt/model is better using real traffic, without waiting weeks for statistical significance.

Step-by-Step: Changing prompts/models safely in production

1. Treat prompts and models as runtime configs

Copy, paste, go—without hard-coding.

Move prompts, system instructions, model names, temperature, and max tokens into AI Configs.
Use clear version labels: search-assistant-v1, search-assistant-v2-guarded, etc.
Use agent graphs in AI Configs to orchestrate multi-agent workflows when needed.

Why this matters:
Now a prompt tweak or model swap is a runtime change, not a deploy. You can adjust behavior in production with an approval and a flag update instead of opening a PR and waiting for CI/CD.

2. Put your new LLM behavior behind a flag

Flags are your kill switches and routing rules.

Create a feature flag like llm_assistant_variation.
Associate flag variations with AI Config versions, for example:
- control → search-assistant-v1
- variant → search-assistant-v2
Evaluate the flag in your application via SDK (backend, frontend, or both) and choose the AI Config based on the flag result.

Runtime effect:
Every user becomes a targeted decision at request time: “Which prompt/model should we use for this user, in this environment, right now?”

3. Start with a tiny, safe cohort

Limit your blast radius.

Start with internal testers only:
- Target by email domain, staff flag, or user role.
- Example: rollout variant to if user.email endsWith "@yourcompany.com".
Use environment-based flags (e.g., staging, sandbox, preview) for early integration checks.
When stable, move to a small customer cohort:
- 1–5% of users by random percentage.
- Specific segments (e.g., high-support customers, beta program members, or low-risk geos).

LaunchDarkly’s targeting engine lets you build hyper-specific segments—from plan level and region to device types and custom attributes—so you can choose exactly who sees the new LLM behavior.

4. Release progressively, expand methodically

Release / Observe / Iterate.

Use percent rollouts and scheduling to expand over time:

Day 1: 1–5% of eligible users → validate nothing is on fire.
Day 3: 15% → watch quality, latency, and error rates.
Day 7: 25–50% → confirm stability and business impact.
Gradually ramp to 100% when metrics look healthy.

Because flag evaluations happen at runtime and propagate worldwide in under 200ms, you can adjust these percentages in the UI, via API, or via CLI—no redeploys, no new builds.

5. Observe LLM behavior and impact in real time

See what’s happening, not just what you shipped.

With AI Configs and observability:

Track LLM-specific signals:
- Response quality via online evals (LLM-as-a-judge).
- Safety or policy violations.
- Latency and timeout rates by variant.
Tie these to product metrics:
- Task completion rate.
- Support ticket deflection.
- Conversion or engagement.
Use error monitoring, performance data, and session replay from observability SDKs to understand regressions at the feature level.

This is the difference between “We changed a prompt and hope it’s better” and “We know prompt v2 reduces failure-to-answer events by 15%.”

6. Run AI experiments, not just toggles

Sometimes you’re not just being careful—you’re optimizing.

Use AI Experiments to compare variants:
- model A + prompt X vs. model B + prompt Y.
- Different system prompts for the same model.
Use Bayesian methods under the hood to get actionable answers faster—you don’t have to be a data scientist or wait for classical significance.
Feed in both:
- Hard product metrics (clicks, solves, revenue).
- LLM-as-a-judge scores (relevance, correctness, tone).

When the experiment shows a clear winner, promote that AI Config to 100% traffic with a single flag update.

7. Keep a kill switch ready at all times

No LLM change is risk-free—even with a small cohort.

Every LLM feature should have a clear kill switch flag.
If you see:
- Spike in failed tasks or errors,
- Safety issues,
- Performance regressions, you flip the flag off and instantly route traffic back to the safe variant.

For high-risk flows, use Guarded Releases/Guardian to define thresholds and trigger automatic pause or rollback when metrics cross a boundary—again, no redeploy required.

Features & Benefits Breakdown (LLM-specific)

Core Feature	What It Does	Primary Benefit
AI Configs for prompts/models/agents	Stores and versions LLM prompts, models, parameters, and agent graphs as first-class production configs.	Gives you a safe, auditable control surface to change AI behavior after deploy.
Targeting & progressive rollouts	Routes real user traffic to different LLM variants with per-segment rules and percent rollouts.	Lets you test new AI behavior on a small cohort before broad exposure, limiting blast radius.
AI Experiments & online evals (LLM-as-a-judge)	Evaluates quality and impact of LLM variants using live traffic and automated judgments.	Helps you pick the best prompt/model quickly based on data, not guesswork.

Ideal Use Cases

Best for launching a new LLM assistant or copilot:
Because it lets you expose the assistant to employees or beta customers first, watch how it behaves in real tasks, and iterate on prompts/models quickly without risking your entire user base.
Best for migrating models or vendors (e.g., model v1 → v2, or provider A → B):
Because you can route a small slice of traffic to the new model, compare quality and latency with AI Experiments, and gradually ramp up only when you’re confident in the new behavior.

Limitations & Considerations

You still need good safety policies and evals:
Runtime control and small cohorts help, but you still need strong prompt design, safety filters, and evaluation criteria. LaunchDarkly helps you govern and observe behavior, but you must define what “good” and “safe” look like for your domain.
Feature flags and AI Configs don’t replace traditional testing:
You should still have unit, integration, and offline evals. The platform makes testing in production safer—by shrinking blast radius and giving you fast rollback—but it doesn’t eliminate the need to test earlier in your lifecycle.

Pricing & Plans

LaunchDarkly is priced as an enterprise-grade runtime control platform—covering feature flags, AI Configs, experimentation, and observability—so you get one unified surface instead of stitching together multiple point solutions.

While specifics depend on your scale and needs, most teams align to:

Growth or Team-level plans: Best for product and engineering teams needing robust feature flags, initial AI Configs usage, and progressive rollouts for a handful of LLM-powered features.
Enterprise plans: Best for organizations with multiple teams, compliance requirements, and many AI features, needing custom roles, approvals, policies, audit logs, and high-volume experimentation across prompts/models/agents.

(For precise pricing and plan fit, a demo is the fastest path.)

Frequently Asked Questions

How do we roll out a new LLM prompt to just 5% of users?

Short Answer: Put the new prompt in an AI Config, attach it to a flag variation, and set a 5% rollout rule for your target audience.

Details:

Create an AI Config for the new prompt (e.g., support-bot-prompt-v2).
Add a variation to your feature flag that maps to this AI Config.
In the LaunchDarkly UI, define targeting:
- Target internal users 100% first (by attribute).
- Then add a rule: “For all other users, send 5% to variant, 95% to control.”
Your SDK evaluates the flag at runtime and picks the correct AI Config on each request.
You can then ramp that percentage up over time as metrics confirm quality.

Can we switch models or prompts instantly if something breaks?

Short Answer: Yes. You can flip a flag or revert an AI Config version to switch behavior globally in under 200ms, without redeploying.

Details:
Because prompts and models live in AI Configs and are referenced via feature flags, the change path is:

Update the active AI Config version, or
Reassign the flag’s default variation back to a known-safe config, or
Turn the feature flag off (kill switch).

LaunchDarkly’s global delivery architecture (99.99% uptime, 100+ points of presence, 45T+ flag evaluations per day) makes those changes propagate to all SDKs almost instantly. That’s how teams replace “2am emergency deploys” with “flip the flag and go back to sleep.”

Summary

If you’re launching LLM features, “ship and pray” isn’t a strategy. Prompts, models, and agents are production changes with real blast radius—and you need a way to control them after deploy.

By using LaunchDarkly’s AI Configs with feature flags and progressive rollouts, you can:

Treat AI behavior as a runtime config, not a code change.
Start with small, safe cohorts (internal first, then targeted customers).
Observe real-world quality and impact with LLM-specific and product metrics.
Experiment on prompts/models and scale the winners quickly.
Keep a kill switch and automated rollback ready for when things go sideways.

You move at AI speed, but you stay firmly in control.

Next Step

Get Started