
LMNT vs OpenAI TTS/Realtime: which is easier to run full-duplex (stream text in while audio streams out) and support barge-in?
Most teams only discover how hard full-duplex voice really is once they try to run text in and audio out on the same connection and let users barge in mid-utterance. At that point, details like streaming protocol, latency budgets, and server-side session control matter more than model quality. This is exactly the gap between LMNT and OpenAI’s TTS/Realtime stacks: both can generate great speech, but they’re not equally straightforward when you want true conversational turn-taking with barge-in.
Quick Answer: LMNT is generally easier to run in a full-duplex pattern with barge-in because it’s built as a low-latency, streaming TTS service that slots cleanly into your existing audio pipeline. You control when text enters, when audio stops, and how your ASR/LLM pair handles interruptions, without being locked into a single vendor’s end-to-end “realtime” orchestration. OpenAI Realtime APIs can support full-duplex, but they’re opinionated, less flexible to integrate into custom audio stacks, and more complex when you want fine-grained control over barge-in behavior.
Why This Matters
If you’re building a conversational app, agent, or game, you don’t just need “good TTS”—you need TTS that behaves like a person in a live conversation. That means:
- Audio starts within ~200 ms of text generation.
- You can keep feeding text while audio streams out.
- You can cut off speech instantly when the user interrupts.
- You can resume output seamlessly after handling the interruption.
If your stack can’t do this reliably, users will talk over your agent, miss key information, and conclude the system is “laggy” or “dumb” even if your LLM is excellent. Full-duplex and barge-in aren’t nice-to-haves; they’re what separate demos from products.
Key Benefits:
- Predictable latency for turn-taking: LMNT’s 150–200 ms low-latency streaming keeps TTS off the critical path, so your turn-taking budget is driven by your LLM and ASR, not by voice.
- Clean control over barge-in logic: With LMNT you own the state machine—stop, resume, or cross-fade audio at will—without fighting an opinionated “realtime” abstraction.
- Scales with your concurrency model: LMNT has no concurrency or rate limits, so you can spin up a WebSocket per session and not worry about hitting provider caps when traffic spikes.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Full-duplex streaming | Sending text to TTS while simultaneously receiving audio back over a live connection. | Enables your agent to “think and speak” continuously instead of waiting for full responses, reducing perceived latency. |
| Barge-in | User interrupts while the system is speaking, and the system stops or adjusts output in real time. | Critical for natural conversation—users don’t wait politely for bots to finish long monologues. |
| Latency budget | The max time you can spend on ASR + LLM + TTS + network before the experience feels laggy. | Determines whether your agent feels human-like; TTS must stay well under ~300 ms to leave room for ASR/LLM. |
How It Works (Step-by-Step)
Here’s how a typical full-duplex, barge-in capable flow looks, and how LMNT vs OpenAI fit into it.
1. Capture and transcribe user audio
You’re already running a bidirectional audio channel—usually via WebRTC (e.g., LiveKit) or WebSockets.
- User speaks; their audio is streamed up from the client.
- Your ASR transcribes partial hypotheses as they speak.
- Once you hit a confidence threshold or pause, you pass text to your LLM.
This part is mostly vendor-agnostic. Where things diverge is how you handle TTS and interruptions.
2. Stream LLM output into TTS
For full-duplex, you don’t wait for the LLM to finish. You stream tokens out as soon as they’re available.
With LMNT:
- Open a streaming TTS connection (e.g., WebSocket) to LMNT for each session.
- As soon as the LLM yields a phrase or sentence, push partial text into LMNT.
- LMNT begins sending back audio frames in 150–200 ms, which you pipe directly into your existing audio transport (WebRTC, WebSocket, or native).
- Keep feeding text chunks while audio flows out; LMNT simply keeps generating speech in order.
With OpenAI TTS / Realtime:
- Classic TTS API is request/response; you send a full prompt and get back an audio file or stream. This is fine for one-shot responses but awkward for incremental, full-duplex generation—each request is a new “chunk,” which complicates timing and smooth playback.
- Realtime API introduces a unified audio + text + tool-calling session. You can in principle stream user audio in and receive TTS out. But:
- You must conform to OpenAI’s realtime protocol and event schema.
- You’re coupling ASR, LLM reasoning, and TTS into a single vendor pipeline.
- Integrating that into an existing stack (e.g., you already have ASR/LLM infra, or WebRTC routing) means working around their orchestration rather than just pulling in TTS.
If you want a drop-in voice output component, LMNT is simpler. If you’re okay handing more of your conversational loop to one provider, OpenAI Realtime can work—but you have less surgical control.
3. Implement barge-in: detect, stop, and resume
Barge-in is mostly about control:
- Detect that the user has started speaking.
- Immediately stop (or duck) TTS output.
- Decide what to do with the partially spoken content.
- Resume once you’ve processed the new user input.
With LMNT:
You implement barge-in in your own state machine:
- Detect interrupt:
- Your audio pipeline sees upstream energy (or ASR starts emitting new tokens) while TTS is playing.
- Stop TTS playback:
- You can:
- Stop sending LMNT’s audio to the user, and/or
- Close the current LMNT stream and reopen when you’re ready.
- Because LMNT is just a streaming TTS endpoint, it doesn’t fight you—there’s no global session logic you can’t override.
- You can:
- Handle the new user utterance:
- Send text to your LLM, generate a new answer.
- Resume TTS with new content:
- Push the new answer into LMNT over the same or a new streaming connection.
- Audio starts again within 150–200 ms.
In practice, you can even implement soft barge-in (duck instead of hard stop) by reducing playback gain while the user speaks, then resuming.
With OpenAI Realtime:
Barge-in is entangled with the provider’s session semantics:
- The same session is handling:
- Incoming user audio
- Live transcription
- LLM reasoning
- TTS back out
- To implement barge-in, you need to:
- Parse and respond to OpenAI’s realtime events.
- Potentially cancel or modify an in-progress assistant response.
- Let the model decide what to do with the interruption, or inject your own control messages.
This can work but is less straightforward if you want deterministic behavior (e.g., always cut off speech as soon as user audio spikes). You’re working inside OpenAI’s orchestration layer instead of treating TTS as a simple, controllable component.
4. Scale out: many sessions, no throttling
For production agents and games, you’ll be running lots of concurrent sessions.
LMNT:
- Built for “conversational apps, agents, and games.”
- No concurrency or rate limits, so one WebSocket per user/session is the natural architecture.
- Pricing is character-based and gets better with volume—your cost model scales as sessions grow.
- SOC-2 Type II and enterprise plans are available “when you’re ready,” so crossing from prototype to production doesn’t require a vendor reset.
OpenAI:
- You’re subject to per-model rate limits and concurrency caps that may change by plan or account status.
- Realtime endpoints share quota with other usage; spikes in one part of your stack can impact your voice concurrency.
- You may need batching and backpressure logic to avoid hitting caps when many users talk at once.
For full-duplex, each extra layer of throttling or backoff directly impacts perceived responsiveness—the moment you start queueing TTS or having to reuse sessions creatively to dodge limits, barge-in behavior gets harder to reason about.
Common Mistakes to Avoid
-
Treating TTS as a blocking step:
Don’t wait for your LLM to produce a full paragraph before calling TTS. Stream partial segments into LMNT as they’re generated so audio starts within ~200 ms and continues smoothly. -
Hardwiring barge-in to the TTS provider’s session logic:
If you rely on vendor-specific semantics (e.g., an assistant event you don’t fully control), your barge-in behavior will be brittle. Keep barge-in logic in your own state machine—detect interruptions in your audio layer and explicitly stop/resume TTS output.
Real-World Example
Imagine you’re building a car dealership assistant like “Big Tony’s Auto Emporium”, using LiveKit for audio and LMNT for voice:
- The user taps “Talk,” audio streams to your backend over WebRTC.
- Your ASR emits partial text; you buffer until a pause, then hit your LLM.
- As soon as you get the first sentence of Tony’s answer, you:
- Send it to LMNT over a streaming TTS connection.
- Pipe LMNT’s audio packets back through LiveKit to the user.
- While Tony is speaking, the user says “Wait, what was the warranty again?”
- LiveKit detects upstream voice; your backend marks an
INTERRUPTstate. - You drop LMNT audio to that user and cancel that LMNT stream.
- You feed the new question to the LLM and start a fresh LMNT stream with the new answer.
- LiveKit detects upstream voice; your backend marks an
The user experiences this like talking to a real salesperson: they can cut Tony off mid-sales pitch, and he immediately pivots.
You can build the same flow with OpenAI Realtime, but you’ll be wiring LiveKit → OpenAI and letting their session decide how to juggle ASR, LLM, and TTS, or you’ll be fighting the abstraction to integrate your own components.
Pro Tip: Start by proving out your audio pipeline + barge-in state machine with LMNT’s free Playground and API before you commit to any end-to-end “realtime” orchestration. Once you know your latency and interruption behavior are solid, it’s much easier to swap LLMs than to unwind a tightly coupled voice stack.
Summary
Running full-duplex text-in / audio-out with robust barge-in is more about control and latency than about which model “sounds better.”
- LMNT gives you a focused, low-latency streaming TTS service (150–200 ms) with no concurrency limits and simple WebSocket-style integration. You stay in charge of the audio graph and barge-in logic, which makes it easier to build conversational apps, agents, and games that behave like real people.
- OpenAI TTS/Realtime can handle full-duplex, but you’re working inside a more opinionated orchestration layer, with quota considerations and less granular control over how speech starts, stops, and is interrupted.
If your goal is a production-ready, full-duplex voice experience with clean barge-in, LMNT tends to be the easier building block to wire into your existing stack.