LMNT vs OpenAI TTS/Realtime: which is easier to run full-duplex (stream text in while audio streams out) and support barge-in?
Text-to-Speech APIs

LMNT vs OpenAI TTS/Realtime: which is easier to run full-duplex (stream text in while audio streams out) and support barge-in?

8 min read

Most teams don’t discover how hard “full-duplex voice” really is until they try to make an agent talk and listen at the same time: streaming text in while audio streams out, letting users barge in mid-sentence, and keeping latency low enough that it still feels like a conversation. That’s exactly the pattern you’re asking about when you compare LMNT vs OpenAI TTS/Realtime for full‑duplex and barge‑in.

Quick Answer: If your goal is to run true full‑duplex voice (stream text in while audio streams out) with reliable barge‑in, LMNT’s streaming TTS is generally easier to wire into a real-time stack because it behaves like a focused, low-latency audio output service. OpenAI’s Realtime APIs are powerful but more opinionated and stateful, so you’ll do more orchestration work to get predictable barge‑in and turn‑taking under production load.

Why This Matters

If you’re building conversational agents, games, or tutors, “voice that talks back” isn’t enough. You need:

  • Full‑duplex so text can stream to TTS while audio is still playing.
  • Barge‑in so users can interrupt and the system responds instantly.
  • Latency budgets that keep total round-trip (user → ASR → LLM → TTS → user) in the sub‑second range.

The wrong TTS/Realtime choice can force you into brittle hacks: clipping audio buffers, re-transcribing output, or tearing down WebSockets on every turn. The right approach slots into your pipeline so you can just stream text, stream audio back, and handle interruptions at the session layer.

Key Benefits:

  • Faster integration for full-duplex voice: LMNT’s low-latency streaming TTS can be dropped into an existing ASR + LLM stack with simple WebSocket logic and no extra concurrency constraints.
  • More predictable barge-in behavior: Separating “audio out” (LMNT) from “logic and ASR” (your app) gives you explicit control over when to pause, flush, or resume speech.
  • Better fit for production scaling: LMNT is designed for conversational apps, agents, and games with no concurrency or rate limits, plus predictable character-based pricing that improves with volume.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Full-duplex streamingSending text into TTS while receiving audio back over the same time window, typically via WebSockets.Enables overlapping compute and playback, reducing perceived latency and making agents feel responsive instead of turn-based.
Barge-inLetting the user interrupt speech playback (by talking or acting), and cutting or reprioritizing TTS output instantly.Critical for natural conversations; without barge-in, voice agents feel like IVR systems from the 90s.
Turn-taking latency budgetThe end-to-end time from user speech start → system response start, often targeted at < 700–1000ms.Dictates whether your experience feels conversational; LMNT’s 150–200ms TTS streaming keeps the TTS part of that budget small.

How It Works (Step-by-Step)

Here’s the high-level flow if you’re building a full‑duplex, barge‑in capable system with LMNT vs with OpenAI’s Realtime stack.

1. LMNT as the streaming TTS layer

LMNT is “just” streaming TTS—on purpose. It ships the piece of the pipeline that must be fast, lifelike, and scalable:

  • Low latency streaming: 150–200ms to first audio, so TTS is not your bottleneck.
  • 24 languages with mid-sentence switching: Useful when your LLM code-switches or you serve global users.
  • Studio quality voice clones from 5 seconds of audio: You can get a production-quality voice per character or per user with minimal capture.
  • No concurrency or rate limits + SOC‑2 Type II: You don’t have to put in artificial throttles just to avoid hitting vendor caps.

Typical full‑duplex flow with LMNT:

  1. User speaks → Your app (or a service like Deepgram / Whisper / etc.) handles ASR via a streaming API.
  2. LLM generates text → As tokens arrive, you stream them to LMNT’s TTS via WebSocket.
  3. LMNT streams audio out in 150–200ms chunks, which you send to the client for immediate playback.
  4. User barges in → You detect speech onset on the client or server and:
    • Stop or fade out the current LMNT stream.
    • Optionally close the TTS WebSocket or just stop feeding it text.
    • Spin up a new turn once you have new LLM output.

Because LMNT is not trying to own ASR, LLM, or session logic, your full‑duplex stack is:

  • Composable: Any ASR, any LLM, any signaling layer (WebRTC, LiveKit, WebSockets).
  • Explicit: You own the state machine for turn-taking and barge-in.
  • Predictable: No hidden “agent” behavior that conflicts with your own logic.

The “Big Tony’s Auto Emporium” demo (realtime speech-to-speech using LiveKit) is essentially this pattern: speech in, LLM reasoning, LMNT streaming speech out, all over realtime transport.

2. OpenAI TTS/Realtime as an integrated voice agent stack

OpenAI’s Realtime APIs aim to bundle ASR + LLM + TTS + session state into one place. You typically:

  1. Open a Realtime WebSocket session.
  2. Stream audio in.
  3. Let the server-side agent transcribe, reason, and generate TTS.
  4. Receive audio out, often on the same session.

You can absolutely implement full‑duplex behavior here (audio in while audio out), but:

  • State is more implicit: You’re working with server-managed “sessions” and “responses,” not just a pure TTS stream.
  • Barge-in semantics are baked into API features: You often have to align your UX to OpenAI’s idea of turn-taking, interruption, and tool calls.
  • Orchestration is less modular: Swapping ASR, TTS, or LLM components independently is harder once you commit to an all-in-one stack.

This is powerful if you want a managed agent and don’t care about fine‑grained control. But if you’re already running your own LLM, ASR, or game loop—and you just need low-latency voice out—this integrated approach can feel constraining for full‑duplex and barge‑in.

3. Putting it together: which is easier for full‑duplex and barge-in?

  • With LMNT, full‑duplex is: “open WebSocket → stream text tokens → stream audio frames back.” Barge‑in is: “detect user speech → stop sending text / close stream → open new stream for the new turn.” You compose it with your own logic and infrastructure.
  • With OpenAI Realtime, full‑duplex is available, but you’re working inside the semantics of a managed agent. Barge‑in is mediated by their session model and your ability to interrupt or cancel responses in-flight.

If you already have—or want—your own ASR/LLM orchestration, LMNT is typically easier to wire into a clean full‑duplex, barge‑in capable pipeline.

Common Mistakes to Avoid

  • Treating TTS as a blocking step:
    Don’t wait for the full LLM response before sending text to TTS. With LMNT, stream tokens as soon as you get them to minimize latency and enable overlap between thinking and speaking.

  • Embedding barge-in logic inside TTS:
    Barge‑in should live at the app/session layer, not in TTS. Use LMNT as a stateless, controllable audio-out stream. Your app should decide when to cut, fade, or resume—this keeps behavior predictable and vendor-agnostic.

Real-World Example

Imagine you’re shipping a voiced NPC for a multiplayer game:

  • You run your own game loop plus an LLM for dialog.
  • Players talk over proximity voice; you run ASR in the background.
  • You want the NPC to start speaking within ~300–400ms after the LLM has enough tokens, and to stop instantly if a player interrupts.

With LMNT:

  1. Each NPC has a WebSocket to LMNT’s streaming TTS.
  2. As your LLM produces tokens (e.g., every 5–10 tokens), you stream them to LMNT.
  3. Your game client starts playing audio as soon as it receives the first chunk (150–200ms).
  4. When any player addresses the NPC mid‑speech, you:
    • Detect the new turn (via ASR or VOIP events).
    • Immediately stop audio playback and close or flush the LMNT stream.
    • Run a new LLM query and start streaming the fresh response to LMNT.

No TTS-side magic. Just clear session logic and a low-latency, unconstrained audio-out service.

Pro Tip: Build your barge-in logic as a pure state machine—“Idle → Speaking → Interrupted → Listening → Speaking”—and treat LMNT as an output device you can pause, flush, or re-open at will. This keeps your design portable, whether you’re targeting web, mobile, or a game engine like Unity.

Summary

For full‑duplex streaming and robust barge‑in, the question isn’t just “LMNT vs OpenAI TTS/Realtime,” it’s “modular TTS vs managed agent.” LMNT gives you a fast, lifelike, affordable streaming TTS layer (150–200ms, 24 languages, 5‑second voice clones, no concurrency limits) that you can compose with any ASR, LLM, and signaling stack. That makes it easier to implement real full‑duplex (text in while audio goes out) and to own your barge‑in behavior at the application level.

OpenAI’s Realtime stack is powerful if you want a bundled agent, but you’ll trade off some control and face more complexity when aligning their session semantics with your game loop or app state machine. If you already think in terms of latency budgets and turn‑taking, LMNT’s focused streaming TTS is usually the simpler path to a production‑ready, full‑duplex voice experience.

Next Step

Get Started