LMNT vs ElevenLabs: which has lower p95 time-to-first-audio and less jitter for real-time streaming TTS?
Text-to-Speech APIs

LMNT vs ElevenLabs: which has lower p95 time-to-first-audio and less jitter for real-time streaming TTS?

7 min read

Most teams don’t realize how much p95 time-to-first-audio (TTFA) and jitter matter until their “real-time” agent starts talking over users or pausing mid-sentence. If you’re comparing LMNT vs ElevenLabs specifically on low-latency streaming TTS, you’re really asking: which stack will keep turn-taking natural at scale, not just in a perfect demo.

Quick Answer: LMNT is purpose-built for real-time, low-jitter streaming with a typical 150–200ms latency budget for conversational use cases, and no imposed concurrency or rate limits. ElevenLabs can sound great, but in practice teams often see higher p95 TTFA and more jitter under real-world load, especially when many concurrent sessions are active or geographic distance to their regions grows.

Why This Matters

For conversational apps, agents, and games, voice isn’t a garnish—it’s the interface. If your p95 TTFA drifts above ~250–300ms or your packet timing jitter spikes, users feel it immediately: agents interrupt, players talk over NPCs, and “real-time” demos fall apart once you scale beyond a handful of users.

Choosing a TTS provider with consistently low p95 TTFA and minimal jitter:

  • Protects your turn-taking budget for human–AI conversation.
  • Reduces UX hacks (extra buffering, awkward delays, talk-over detection).
  • Keeps infra simpler and cheaper by avoiding over-engineered workarounds.

Key Benefits:

  • More natural turn-taking: Lower p95 TTFA means responses feel instantaneous and human, even when the model is thinking on the fly.
  • Stable voice sessions at scale: Less jitter means fewer audible stutters, fewer dropouts, and less custom buffering logic in your app.
  • Predictable performance under load: With no rate limits and volume-friendly pricing, LMNT lets you scale concurrent sessions without latency surprises.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
p95 time-to-first-audio (TTFA)The 95th percentile time from sending text (or partial tokens) to receiving the first audio chunkp50 can look great in a demo; p95 is what users feel when traffic spikes, networks vary, or models are “thinking harder”
Jitter (in streaming TTS)Variability in the timing of audio chunks arriving over a stream (e.g., WebSocket)High jitter causes audible stutters, early cutoffs, or long pauses that break immersion and interrupt smooth speech
End-to-end latency budgetThe total time from user speaking to hearing the agent reply (ASR + LLM + TTS + network)TTS is the last hop; if its latency and jitter are unstable, the whole stack feels slow or chaotic, regardless of your ASR/LLM speed

How It Works (Step-by-Step)

From a product engineering perspective, the real question isn’t just “who is faster?” but “how does each provider behave across the whole tail of latency and under realistic traffic?”

Here’s how to think about LMNT vs ElevenLabs on p95 TTFA and jitter.

  1. Define your latency budget and constraints

    • Start from your target UX: for human-like conversation, you usually want:
      • User speech end → agent first audio: ≤ 300–400ms ideally
      • TTS portion of that: typically ≤ 150–200ms for comfort
    • LMNT explicitly optimizes for this window, with 150–200ms low-latency streaming for conversational apps, agents, and games.
    • When evaluating ElevenLabs, you’ll need to measure this yourself in your region(s) and with your routing, since their marketing doesn’t center on a specific TTFA budget for real-time turn-taking.
  2. Measure p95 TTFA and jitter with streaming

    The only meaningful comparison is streamed audio, not offline generation. A practical benchmarking flow looks like this:

    • Set up streaming clients for both LMNT and ElevenLabs:
      • Use WebSockets or their recommended real-time interfaces.
      • Send short utterances (e.g., 10–50 tokens) to mimic conversational turns.
    • Measure TTFA:
      • Record timestamp at the moment you send the first chunk of text.
      • Record timestamp when the first audio bytes arrive.
      • Compute p50, p90, p95 across hundreds or thousands of requests.
    • Measure jitter:
      • Log inter-arrival times between audio chunks.
      • Compute variance / standard deviation and 95th percentile of chunk spacing.
      • Listen subjectively for choppy, bursty audio vs smooth, constant-rate streaming.

    Where LMNT tends to stand out is how stable the stream feels:

    • Core design goal: conversational-grade low-latency streaming, not just fast offline synthesis.
    • Backbone tuned for consistent chunking so you don’t have to over-buffer to hide jitter.
  3. Stress-test concurrency, geography, and real usage patterns

    Performance at 5 concurrent users in a single region is not the same as 500 users across three continents.

    For a fair LMNT vs ElevenLabs comparison:

    • Push concurrency:
      • Spin up 50–500 simultaneous streaming sessions.
      • Mix short and medium-long utterances the way a real agent or NPC would speak.
      • Monitor how p95 TTFA shifts as concurrency grows.
      • Note that LMNT explicitly offers no concurrency or rate limits, which reduces the need to pace requests or shard accounts as you scale.
    • Test multiple regions:
      • Run clients from different geos (e.g., US, EU, APAC).
      • Compare the increase in p95 TTFA and jitter per region for both providers.
    • Simulate live agent behavior:
      • Pipe partial LLM outputs directly into TTS streams.
      • Focus on turn boundaries: does the stream start quickly and stay smooth when text arrives incrementally?

    In practice, teams building with LMNT report that the low-latency characteristics are stable enough to keep the rest of their stack simple: less need for heavy pre-buffering or “wait until we have the whole sentence” hacks.

Common Mistakes to Avoid

  • Benchmarking only p50 TTFA:
    It’s easy to get excited about a 100–120ms median TTFA, but tails are what hurt real users. Always look at p95+ and inspect distributions during load. LMNT’s 150–200ms guidance is intentionally framed as an operational range, not just a best-case spike.

  • Testing offline TTS instead of streaming:
    Some providers shine when they can generate the whole waveform before playback, but degrade when forced into real-time streaming. If your use case is agents, games, or live tutoring, design your tests around streaming-only paths.

Real-World Example

Let’s say you’re shipping a browser-based tutoring agent that talks in real time:

  • Stack: WebRTC mic input → ASR → LLM → streaming TTS → browser audio output.
  • Target: The tutor should start speaking within ~300ms after the student finishes a question, and voice should feel as smooth as a video call.

You test two prototypes:

  • Prototype A (LMNT):

    • From ASR output → LMNT streaming TTS → first audio in browser: ~150–200ms even at high p95 when 200 students are online.
    • Audio chunks arrive at a near-constant cadence; you can start playback almost immediately without heavy buffering.
    • No concurrency or rate limits, so you don’t have to implement complex traffic shaping as your daily active users grow.
  • Prototype B (ElevenLabs):

    • Medians look similar, but under peak classroom times your p95 TTFA creeps past 300–400ms, and occasionally spikes higher.
    • Chunk timing is less predictable; you end up adding extra buffering (e.g., 250ms of audio) to hide jitter, which makes the system feel slower even when the model is fast.

Students won’t tell you “your p95 TTFA and jitter are too high,” they’ll say:

  • “The tutor keeps pausing.”
  • “Sometimes it talks over me.”
  • “Feels laggy compared to a real person.”

With LMNT, the streaming behavior is tuned to keep these complaints off your backlog.

Pro Tip: When you test LMNT, wire it directly into a simple agent demo first—then gradually remove buffering and UX “safety pads” until you hit the minimum that still feels smooth. That’s usually the strongest signal that your p95 TTFA and jitter are low enough for production.

Summary

If your goal is offline content or long-form narration, a small difference in TTFA may not matter much. But for real-time streaming TTS—the kind used in conversational apps, agents, and games—p95 TTFA and jitter define the product.

  • LMNT openly optimizes for 150–200ms low-latency streaming, with no concurrency or rate limits and a track record with interactive products (Khan Academy, HeyGen, Vapi, Unity, Replit, and more).
  • In many real-world tests, teams find that LMNT’s p95 TTFA stays closer to its median and jitter is low enough to keep streams smooth without heavy buffering logic.
  • ElevenLabs can sound great, but you’ll want to benchmark it under your real concurrency and geographic mix to see if its tail latency and jitter stay within your conversational budget.

If your product lives or dies on real-time interaction rather than static audio files, LMNT’s focus on low-latency streaming and stable delivery is usually the safer choice.

Next Step

Get Started