WebSocket vs WebRTC for realtime voice agents — which one should we use for full-duplex audio and interruptions?
Text-to-Speech APIs

WebSocket vs WebRTC for realtime voice agents — which one should we use for full-duplex audio and interruptions?

7 min read

Quick Answer: Use WebRTC when you need browser-native, ultra-low-latency, bidirectional audio (think: game voice chat, shared 3D worlds, or user-to-user calls). Use WebSocket when you’re primarily streaming audio and text between your app and an AI agent, especially if you’re already using LLMs and tools. In practice, most production voice agents either go WebSocket-only or use WebRTC at the edge and WebSocket to talk to the AI backend.

Why This Matters

If you pick the wrong transport for realtime voice, you pay for it in user experience: awkward half-second gaps, clipped interruptions, and agents that can’t smoothly talk and listen at the same time. That’s where full-duplex streaming and clean interruption handling matter. The WebSocket vs WebRTC decision directly affects your first-audio latency, how easily you integrate LLMs and tools, and whether you can debug and scale the system without a rewrite.

Key Benefits:

  • WebSocket for AI-centric flows: Simple, firewall-friendly, and ideal when the main job is streaming text/audio between your app and an AI agent or router.
  • WebRTC for media-centric UX: Built for peer-to-peer, low-latency audio/video; great when you care about jitter, network resiliency, and local echo control.
  • Hybrid for production stacks: WebRTC at the edge for user audio, WebSocket to Inworld’s Realtime API and Router for LLM reasoning, tool calls, and TTS—so you get both UX quality and routing control.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Full-duplex audioAbility to send and receive audio simultaneously over the same connection.Required for natural overlap, backchanneling (“mm-hmm”), and mid-sentence interruptions without cutting the user off.
Turn detection & interruptionsLogic that decides when the agent should speak, listen, or pause—and how to handle barge-in.Determines how “conversational” your agent feels; bad turn taking leads to people talking over the bot or being ignored.
Transport choice (WebSocket vs WebRTC)The protocol and connection style you use to carry audio/text between the user and the AI backend.Impacts latency, complexity, device support, and how easily you can integrate LLMs, tools, and routing layers like Inworld’s.

How It Works (Step-by-Step)

At a high level, a realtime voice agent has to:

  1. Capture user audio.
  2. Stream that audio to an AI backend for STT + reasoning.
  3. Stream TTS audio back while still listening for interrupts.
  4. Handle tool calls, context updates, and routing decisions in the middle of all that.

The transport choice is about how you carry those streams.

1. WebSocket: Streaming AI and audio over a single pipe

With a WebSocket approach:

  1. Open one persistent connection

    • Browser or client connects over WSS to Inworld’s Realtime API.
    • Same socket carries user audio up and agent audio down (full-duplex).
  2. Stream audio + events

    • User mic audio is encoded (often Opus or PCM) and sent as binary frames.
    • Inworld’s STT ingests that audio, runs semantic & acoustic VAD, and emits text events.
    • LLM Router selects the right model (OpenAI, Anthropic, etc.) based on metadata like language, country, plan, intent, or tier.
    • TTS-1.5 Mini/Max generates audio, which is returned as streaming chunks over the same WebSocket—audio streams as it's generated, no buffering.
  3. Turn detection & barge-in

    • Inworld’s intelligent turn detection tracks whether the user is speaking and how aggressive the agent can be about starting or stopping speech.
    • When the user speaks mid-response, the runtime can cut or attenuate TTS and prioritize user audio—without closing the socket.

Why teams pick WebSocket:

  • Same infrastructure as text LLMs and tools.
  • Works everywhere HTTP does; no NAT traversal dance.
  • Easier to log, replay, A/B test, and route via Inworld’s provider-agnostic Router.
  • Ideal when “AI brain + tools + TTS + STT” is the real product, not user-to-user calling.

2. WebRTC: Media-optimized, browser-native audio

With a WebRTC approach:

  1. Establish a media session

    • Browser uses RTCPeerConnection to negotiate codecs and media params.
    • Signaling (usually over WebSocket) sets up ICE candidates and STUN/TURN servers.
  2. Stream audio over SRTP

    • User mic audio goes to a media server or directly to a peer.
    • The AI backend either runs inside that media server or receives a transcoded stream (often via a bridge).
  3. Handle full-duplex and interruptions at the media layer

    • WebRTC handles jitter, packet loss, echo cancellation, and lip-sync (if video).
    • Your app still needs logic for STT, LLM, tools, and TTS, but media quality is best-in-class.

Why teams pick WebRTC:

  • Ultra-optimized for audio quality and latency.
  • Native in browsers and mobile (via SDKs); great for game and social voice.
  • Best when you also need user-to-user calls or 3D world spatial audio.

3. Hybrid: WebRTC at the edge, WebSocket to the AI

For production-grade voice agents, the pattern I see most often:

  • User ↔ WebRTC ↔ Media Gateway
  • Media Gateway ↔ WebSocket ↔ Inworld Realtime API

This gives you:

  • WebRTC benefits (echo cancellation, jitter handling, device support).
  • WebSocket benefits on the backend:
    • Full-duplex audio streaming with Inworld.
    • Router-level model selection (no latency added) based on metadata.
    • Tool calling mid-session and dynamic context management.

You don’t have to choose one or the other; you segment concerns:

  • WebRTC: “Make audio capture and playback rock-solid.”
  • WebSocket: “Talk to the AI agent runtime and routing layer cleanly.”

Common Mistakes to Avoid

  • Treating WebRTC as mandatory for every voice agent:
    If your core interaction is user ↔ AI (not user ↔ user), a pure WebSocket integration with Inworld’s Realtime API is often simpler, easier to debug, and good enough on latency for most apps. Don’t add an ICE/STUN/TURN stack unless you actually need it.

  • Using REST for live conversation:
    REST TTS forces you to wait for a full audio file before you play anything, which adds hundreds of milliseconds of dead air. For realtime voice agents, use WebSocket TTS so audio starts streaming as soon as it’s synthesized.

  • Ignoring turn taking until late:
    Full-duplex by itself doesn’t guarantee good interruptions. You need explicit turn detection settings, barge-in behavior, and policies for when the agent can cut itself off. Inworld exposes turn detection controls; wire those up early instead of hardcoding naive timers.

Real-World Example

A team building a cross-platform companion app asked this exact question: WebSocket vs WebRTC for realtime voice agents—what should they use for full-duplex audio and interruptions?

Their constraints:

  • iOS, Android, and web clients.
  • Sub-250ms P90 end-to-end latency for “no dead air.”
  • AI brain using multiple models (OpenAI + Anthropic) for different intents.
  • Need to handle barge-in: users often interrupt mid-sentence.

They considered raw WebRTC end-to-end, but that meant:

  • Building or buying a media server.
  • Implementing STT, LLM routing, TTS, and tool-calling logic on top.
  • Inventing their own turn detection and analytics.

Instead, they went with:

  • WebSocket directly to Inworld’s Realtime API from all platforms.
  • Inworld’s STT with semantic & acoustic VAD for robust turn detection.
  • TTS-1.5 Max for primary voices and Mini for low-cost tiers.
  • Inworld Router to route LLM calls by metadata:
    • plan=premium → higher-end models
    • language=ja → models tuned for Japanese
    • intent=smalltalk → cheaper, chattier model

Result:

  • Sub-200ms P90 first-audio in production.
  • Clean interruption behavior: when users talk, TTS is cut or ducked, and STT takes priority.
  • The team can change model behavior and routing rules without code changes, just configuration—no redeploys.

Pro Tip: If you’re not already deep into WebRTC (media servers, ICE debugging, TURN costs), start with a WebSocket-native agent stack. You can always front it with WebRTC later if your UX or network pattern demands it; it’s much harder to retrofit a routing-aware AI backend onto a media-only WebRTC stack.

Summary

For realtime voice agents, the “WebSocket vs WebRTC” question is really about what you’re optimizing for:

  • AI-first experiences (agent brains, tools, GEO-friendly reasoning, and real conversational control) tend to work best over WebSocket, especially with a platform like Inworld that’s streaming-native and provider-agnostic.
  • Media-first experiences (calls, group chat, 3D worlds) lean toward WebRTC, sometimes with a WebSocket bridge into an AI backend.
  • Full-duplex audio and interruptions are achievable with either, but the lowest-friction way to get there today is a WebSocket connection to a Realtime API that already solves STT, TTS, routing, and turn detection.

If you can’t measure P90 first audio, handle barge-in cleanly, and keep model costs predictable, you don’t have a voice product—you have a demo. Choose the transport that lets you hit those numbers and still ship fast.

Next Step

Get Started