LMNT vs ElevenLabs: which has lower p95 time-to-first-audio and less jitter for real-time streaming TTS?
Text-to-Speech APIs

LMNT vs ElevenLabs: which has lower p95 time-to-first-audio and less jitter for real-time streaming TTS?

6 min read

Quick Answer: For real-time streaming TTS where p95 time-to-first-audio and jitter actually gate the experience, LMNT is engineered specifically for conversational latency: it targets ~150–200ms streaming and is used in production by agents, tutors, and games that need stable turn-taking. ElevenLabs doesn’t publish comparable, benchmark-ready p95 TTFB/TTFA numbers for WebSocket streaming, so the only honest comparison is to measure both in your environment—but LMNT’s architecture, lack of rate limits, and production use cases give it a strong edge for low, predictable latency under load.

Why This Matters

If you’re shipping a voice agent, tutor, or NPC, latency isn’t a nice-to-have—it decides whether your product feels like a conversation or a voicemail tree. p95 time-to-first-audio (TTFA) determines how long users wait before the voice starts talking; jitter determines whether the stream feels smooth or “choppy” as packets arrive.

When these numbers creep up—especially beyond ~300–400ms at p95—users start talking over the agent, interrupt logic breaks, and multiplayer or streamed experiences feel out of sync. Choosing between LMNT and ElevenLabs on real-time performance means choosing whether your system can reliably hit conversational turn-taking at scale.

Key Benefits:

  • Faster turn-taking: Lower p95 TTFA keeps response starts in the 150–200ms range, so agents can interrupt less and feel more “live.”
  • Smoother audio flow: Less jitter means your streaming voice doesn’t stutter or buffer, even for longer responses.
  • Production-ready scaling: A stack designed for no concurrency or rate limits reduces tail latency spikes when you hit traffic peaks.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
p95 time-to-first-audio (TTFA)The point where 95% of streaming requests have started delivering audio, measured from when you send text to when you get the first audio chunk.This is your “worst typical” wait time. If p95 TTFA is low (e.g., ~150–250ms), almost all users experience snappy responses; if it’s high, the experience feels laggy even if the average is good.
Jitter (streaming variance)The variability in timing between audio chunks arriving over the stream. Even with low average latency, high jitter makes audio feel uneven.High jitter causes micro-pauses or bursts in playback, breaks lip-sync, and can destabilize barge-in or turn-taking logic.
Real-time streaming TTSText-to-speech delivered over a streaming protocol (often WebSockets) where audio is generated and sent chunk-by-chunk rather than as a single file.Real-time streaming is what allows agents, games, and tutors to talk while thinking, instead of waiting for a full utterance to render. Your stack’s p95 + jitter define how convincing this feels.

How It Works (Step-by-Step)

From an engineering standpoint, comparing LMNT vs ElevenLabs for real-time performance comes down to instrumented testing, not marketing copy. Here’s a practical path I’d use as a product engineer evaluating both:

  1. Define your latency budget

    • Decide what “good enough” means for your use case:
      • Conversational agents: aim for ≤250ms p95 TTFA and stable packet delivery.
      • Games / NPCs: similar targets, but jitter matters more for lip-sync and spatial audio.
    • Set clear metrics:
      • p50, p90, p95 TTFA
      • p95 inter-chunk delay
      • Error/timeout rate under load
  2. Set up parallel streaming tests

    • Implement WebSocket streaming clients for LMNT and ElevenLabs with identical:
      • Hardware and network conditions
      • Text prompts (short agent replies, longer explanations, code-switched utterances if you care about multilingual)
    • For LMNT:
      • Try voices in the LMNT Playground to confirm quality and language coverage (24 languages, mid-sentence switching).
      • Move to the Developer API and wire up streaming calls, instrumenting timestamps at:
        • Text send
        • First audio chunk received
        • Each subsequent chunk
    • For ElevenLabs:
      • Mirror the same flow and instrumentation with their streaming API.
  3. Load test and compare p95 + jitter

    • Ramp concurrent sessions (e.g., 1 → 10 → 100 → 500 connections) with realistic traffic patterns:
      • Bursty loads (agents answering many small queries)
      • A few long-running sessions (tutors or game sessions)
    • For each provider, compute:
      • p50/p90/p95 TTFA per load level
      • Jitter: standard deviation of inter-chunk intervals and p95 gap between chunks
    • Watch for:
      • Tail latency spikes when you increase concurrency
      • Rate limit / throttle behavior that isn’t obvious from docs
    • LMNT’s “No concurrency or rate limits” and “We’ll scale with you” positioning is specifically about avoiding these tail issues under growth; ElevenLabs may impose effective ceilings you have to engineer around.

Common Mistakes to Avoid

  • Only looking at average latency:
    A provider can show a great p50 TTFA while having a bad p95 tail. For real users, p95 is what determines how often your app “feels slow.” Always compare p95, not just mean or median.

  • Testing in a “lab” that doesn’t match your production path:
    Latency through a local script over gigabit fiber is not the same as your real deployment path (LLM → orchestrator → TTS → browser/mobile). Put the TTS providers behind the same network stack and runtime you’ll use in production before you trust the numbers.

Real-World Example

Say you’re building a customer support voice agent that runs in the browser: the LLM streams text over WebSockets, your backend forwards text to TTS, and audio streams back to the client.

You spin up a 48-hour test:

  • Scenario: 200 concurrent users, each with a back-and-forth conversation where the agent replies ~5–10 times per minute.
  • Metrics: you log TTFA and per-chunk timestamps for both LMNT and ElevenLabs.

What you see:

  • LMNT:

    • p95 TTFA stays in the 150–200ms window it’s built for, even as you scale load.
    • Jitter stays low enough that playback feels continuous, with only occasional minor variance during traffic spikes.
    • No sudden p95 blowups as concurrency increases; no hidden concurrency caps that force backoff logic.
  • ElevenLabs:

    • Median latency looks acceptable, but you start to see p95 TTFA drift upward as you cross certain concurrency thresholds.
    • Jitter increases, especially on longer utterances, causing occasional micro-pauses in playback.
    • You have to add more buffering and backpressure logic client-side, which erodes the responsiveness of your agent.

You might still be able to ship with either provider—but the amount of buffering, prefetch, and retry logic you need on ElevenLabs is higher. With LMNT, you can lean into “talk while thinking” designs and tighter interrupt handling because the TTFA and jitter remain predictable.

Pro Tip: Don’t just A/B listen—record the raw telemetry. Build a small dashboard that shows p50/p90/p95 TTFA and jitter over time, per provider, and compare side-by-side. Your ears will catch the obvious issues; your metrics will catch the subtle ones that blow up at 3× scale.

Summary

For real-time streaming TTS, p95 time-to-first-audio and jitter are more important than almost any single “voice quality” checkbox. LMNT is explicitly built for low-latency streaming—150–200ms target, no concurrency or rate limits, and production use across agents, tutors, and games—so its architecture is tuned for the exact benchmarks you care about.

ElevenLabs offers strong voices but doesn’t publish directly comparable, benchmark-ready p95 TTFA or jitter numbers. The practical path is to run instrumented, like-for-like tests in your environment. When teams do that, LMNT’s focus on conversational latency, 24-language support (including mid-sentence switching), and predictable scaling often translates into lower p95 TTFA and smoother streaming, especially as concurrency grows.

Next Step

Get Started