What’s an acceptable end-to-end latency budget for turn-taking in a real-time voice assistant?
Text-to-Speech APIs

What’s an acceptable end-to-end latency budget for turn-taking in a real-time voice assistant?

7 min read

Quick Answer: For natural turn-taking in a real-time voice assistant, you should target an end-to-end latency budget of ~300–700 ms from user end-of-speech to the assistant’s first audible response. Under ~300 ms feels snappy and human-like; beyond ~1 second starts to feel broken or untrustworthy in live use.

Why This Matters

Turn-taking latency is the difference between “feels like talking to a person” and “feels like waiting on hold.” In a real-time voice assistant, the total delay from when a user stops speaking to when they hear the assistant reply determines whether they’ll keep talking, interrupt, or abandon the experience entirely. If you don’t set and enforce a clear latency budget across ASR, LLM, and TTS, your assistant might sound impressive in demos but fail in production conversations.

Key Benefits:

  • Higher conversational trust: A tight latency budget makes the assistant feel responsive enough that users keep talking instead of dropping back to text or tapping buttons.
  • More natural turn-taking: When the system replies within sub-second windows, human habits like barge-in, backchanneling (“yeah”, “uh-huh”), and corrections work as expected.
  • Predictable performance at scale: A concrete budget lets you choose vendors, architectures, and timeouts that maintain responsiveness even under load.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
End-to-end latency budgetThe maximum total time you allow from user end-of-speech to the first synthesized audio frame reaching the user.Gives your engineering team a hard target to design around and test against.
Turn-taking latencyThe specific slice of latency that affects when one speaker (the assistant) can start talking after the other (the user) stops.Directly shapes perceived responsiveness and “is this thing listening?” moments.
Component breakdownLatency allocated across capture, ASR, LLM/agent reasoning, TTS, and network.Enables you to diagnose bottlenecks and choose tools (like low-latency TTS) that keep you inside budget.

How It Works (Step-by-Step)

Before you pick numbers, it helps to define the full chain. In a typical real-time voice assistant, the turn-taking path looks like this:

  1. User finishes speaking → Voice activity detection (VAD):

    • Detect that the user stopped talking (end-of-speech).
    • Typical target: 50–150 ms after actual speech end, depending on aggressiveness.
  2. Streaming ASR + LLM reasoning:

    • ASR streams text partials while the user is speaking; by turn end, you already have most of the transcript.
    • LLM/agent starts generating tokens as soon as it has enough context (often before the full utterance is complete if you support incremental semantics).
    • Targets:
      • ASR finalization: 50–200 ms after VAD end.
      • First LLM token: 50–150 ms after ASR has enough text.
  3. Streaming TTS + delivery to user:

    • TTS turns initial tokens into audio and starts streaming them back.
    • This is where a low-latency TTS like LMNT matters: 150–200 ms streaming from text to first audio frame lets you keep end-to-end latency in the conversational range, even with network overhead.
    • Network + playback buffering: account for 50–150 ms in realistic consumer conditions.

Add these components up, and you can see how quickly “snappy” becomes “laggy” if a single layer spikes.

Suggested latency budgets by experience type

Think in tiers, based on how “live” the experience feels:

  • Tier 1 – Conversational agents, games, and live support

    • Target: 300–700 ms end-to-end turn-taking latency.
    • Hard ceiling: ~1,000 ms before people start double-checking if they were heard.
    • Example stack:
      • VAD detection: 75 ms
      • ASR finalization: 125 ms
      • LLM first token: 100 ms
      • TTS first audio: 150–200 ms (LMNT-style low-latency streaming)
      • Network/playback: 75–150 ms
  • Tier 2 – Tutors, narrators, and semi-interactive tools

    • Target: 700–1,500 ms.
    • Users will tolerate slightly more delay if they expect “thinking time” (e.g., complex explanations), but you still want the first phoneme fast, even if the rest streams in.
  • Tier 3 – Non-interactive or batch voice

    • Target: anything acceptable; latency is less critical than throughput, cost, or quality.
    • Turn-taking budget mostly doesn’t apply here.

How to carve your actual budget

Start from a top-down number and assign slices:

  • Pick a global budget (e.g., 600 ms).
  • Reserve ~200 ms for TTS + playback if you’re using a low-latency engine like LMNT (150–200 ms).
  • Allocate ~200–250 ms to ASR + VAD.
  • Give ~100–150 ms to your LLM or agent.
  • Use remaining headroom for network variability.

That yields a budget like:

  • 100 ms VAD
  • 150 ms ASR
  • 100 ms LLM
  • 200 ms TTS
  • 50 ms network jitter buffer
    = 600 ms end-to-end

You’ll adjust per geography, device type, and network conditions, but this framework keeps everyone aligned.

Common Mistakes to Avoid

  • Over-optimizing one layer and ignoring the rest:

    • Teams sometimes obsess over LLM speed while running high-latency TTS or ASR. A super-fast model won’t save you if your TTS adds 700 ms.
    • Fix: set a concrete end-to-end budget first, then pick components that can all live inside it. Low-latency TTS (150–200 ms) is often the easiest big win.
  • Measuring latency from the server’s point of view only:

    • Server-to-server numbers might look fine while user-perceived latency is 2x higher thanks to capture, playback, and jitter buffers.
    • Fix: measure from microphone to speaker on real devices, on real networks. Include browser or mobile app overhead, not just backend timings.
  • Waiting for full text before doing anything:

    • If your LLM and TTS both wait for the entire user utterance to finalize, you’re throwing away precious hundreds of milliseconds.
    • Fix: support streaming all the way down—ASR streaming partials, LLM streaming tokens, and TTS streaming audio as soon as it has usable text.

Real-World Example

Imagine you’re building a real-time voice agent for a car dealership—something like a virtual “Big Tony” that talks customers through trade-in options while they browse. The experience needs to feel like talking to a salesperson, not filling out a form.

You set a 600 ms end-to-end latency budget for turn-taking. In practice:

  • When the user finishes a sentence (“What can I get for a 2018 Civic?”), your VAD fires within ~80 ms.
  • Your ASR has been streaming partials and emits a final transcript 120 ms later.
  • Your LLM, already primed with context, streams back the first token in another 80–100 ms.
  • LMNT’s low-latency TTS converts the initial chunk of text into audio and starts streaming it out within 150–200 ms.
  • A small playback buffer and network jitter add ~75 ms.

Net result: the user hears “Great question. For a 2018 Civic…” roughly half a second after they stop talking. They don’t need to repeat themselves, they feel heard, and they keep conversing. On the backend, you stay within budget even as concurrent sessions spike, because there are no concurrency or rate limits on the TTS layer and latency stays predictable.

Pro Tip: Don’t just log averages—log p95 and p99 end-to-end latency per turn, with component attribution (VAD, ASR, LLM, TTS, network). Most user frustration hides in the tail; you want your 95th percentile still under your “feels good” threshold.

Summary

For a real-time voice assistant, an acceptable end-to-end latency budget for turn-taking is typically 300–700 ms, with an absolute ceiling of ~1 second before the experience feels unreliable. The key is to treat that number as a hard constraint, then work backwards: allocate budgets across VAD, ASR, LLM, low-latency TTS (e.g., 150–200 ms streaming from LMNT), and network. When you design for streaming at every layer and measure from mic to speaker—not just server-to-server—you can ship assistants, agents, and games that actually feel conversational, not just impressive in a scripted demo.

Next Step

Get Started