
How do I implement an LMNT real-time speech session over WebSocket (full-duplex streaming)?
Quick Answer: You implement an LMNT real-time speech session over WebSocket by opening a full-duplex connection to LMNT’s realtime endpoint, sending a JSON “session start” message, then streaming text (and optionally user audio or control messages) while consuming low-latency audio frames in parallel. Use a streaming audio sink (e.g., Web Audio, WebRTC, or a native audio buffer) to play frames as they arrive and keep the connection open for turn‑taking.
Why This Matters
If you’re building a conversational agent, tutor, or game character, HTTP-style “request, then wait for an entire audio file” isn’t fast enough. You need 150–200ms end‑to‑end latency and true full‑duplex behavior so speech can start while the model is still generating—and your user can interrupt, talk over, or trigger the next line.
Real-time WebSocket streaming with LMNT gives you that: the TTS model streams audio as it speaks, you send control/text/messages mid‑stream, and you avoid per‑request overhead that bottlenecks agents in production.
Key Benefits:
- Conversational latency: Hit ~150–200ms speaking latency so your agent feels like a human turn‑taker, not a call center IVR.
- True full‑duplex control: Stream text, signals, or user audio while LMNT returns audio frames in parallel—ideal for barge‑in and rapid back‑and‑forth.
- Scales without throttling: LMNT’s “No concurrency or rate limits” posture means your WebSocket design can fan out across agents and games without hitting hidden ceilings.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Full‑duplex WebSocket | A WebSocket connection where client and server can send messages independently at any time. | Enables LMNT to stream audio while you send new text, controls, or user audio without opening new connections. |
| Streaming TTS session | A long‑lived LMNT session that handles multiple messages: session config, text chunks, control events, and streaming audio frames. | Reduces setup overhead and lets you chain turns (questions, answers, interruptions) in a single logical conversation. |
| Low‑latency playback pipeline | Your client-side stack for decoding and playing LMNT’s audio frames as they arrive. | The difference between “demo sounds good” and “product feels real”—poor playback can blow your latency budget even with fast TTS. |
How It Works (Step-by-Step)
At a high level, an LMNT real-time speech session over WebSocket (full-duplex streaming) looks like this:
- Open a WebSocket connection to LMNT’s realtime endpoint.
- Send a “session start” message with config (voice, language, format).
- Stream text/control messages to LMNT while receiving audio frames and events.
- Decode and play audio frames as they arrive.
- Keep the socket open for multiple turns, then send a clean “session end”.
Below is a more concrete breakdown you can adapt to your stack.
1. Create the WebSocket connection
Use your language’s WebSocket client to connect to LMNT’s realtime URL (check https://api.lmnt.com/spec for the exact path and auth scheme—typically a Bearer token).
Example (TypeScript / browser):
const ws = new WebSocket("wss://api.lmnt.com/v1/realtime?model=lmnt-tts");
ws.onopen = () => {
// You’ll send a session start message here.
};
ws.onmessage = (event) => {
// You’ll handle audio frames and events here.
};
ws.onerror = (err) => {
console.error("LMNT WebSocket error", err);
};
ws.onclose = () => {
console.log("LMNT WebSocket closed");
};
For Node, use ws or a similar client. For game engines like Unity, use their WebSocket client to connect from C#.
2. Send a session start / config message
Once onopen fires, send an initial JSON message that:
- Authenticates the session (if not handled via headers/query).
- Chooses your voice (e.g.,
"brandon"for a broadcaster style or your clone ID). - Sets language and optional code-switching behavior.
- Sets audio format (e.g., 16‑bit PCM, Opus, or another format the LMNT spec supports).
Example (pseudo-JSON):
ws.onopen = () => {
const sessionStart = {
type: "session.start",
auth: {
// or pass via Authorization header; follow the spec
api_key: "<LMNT_API_KEY>",
},
config: {
voice: "brandon",
language: "en-US",
// LMNT supports 24 languages and mid-sentence switching
sample_rate: 24000,
format: "pcm16",
// You can add session-level options here (e.g., style, rate)
},
};
ws.send(JSON.stringify(sessionStart));
};
LMNT’s docs will define the exact type strings and config options; align with those. The key: send this once per connection to initialize the streaming session.
3. Stream text and control messages
Once the session is accepted, you can send text segments that LMNT will convert to speech—while audio frames are streaming back to you.
Basic text-to-speech message:
function sendUtterance(text: string, utteranceId: string) {
const msg = {
type: "input.text",
id: utteranceId,
text,
};
ws.send(JSON.stringify(msg));
}
// Example
sendUtterance(
"Welcome to LMNT. I’ll read the latest headlines in our 'brandon' voice.",
"utt-1"
);
Since this is full‑duplex, you can:
- Send multiple
input.textmessages back‑to‑back. - Send “stop”/“cancel” messages (e.g.,
type: "input.cancel", id: "utt-1") to cut off an utterance (for barge‑in). - Dynamically adjust parameters mid-session (e.g.,
type: "session.update"to tweak speaking style).
Your application logic might:
- Listen to an LLM stream, chunk text, and send each chunk as
input.text. - React to user events (click, speech recognition result) by canceling current audio and sending a new message.
4. Handle incoming audio frames and events
On the received side, LMNT will stream back audio frames and events as JSON and/or binary messages.
A typical pattern:
- JSON messages for events (
session.started,input.started,input.completed, errors). - Binary or base64 payloads for audio frames.
Handling event messages:
ws.onmessage = (event) => {
if (typeof event.data === "string") {
const msg = JSON.parse(event.data);
switch (msg.type) {
case "session.started":
console.log("LMNT session ready");
break;
case "input.started":
console.log("Utterance started:", msg.id);
break;
case "input.completed":
console.log("Utterance completed:", msg.id);
break;
case "error":
console.error("LMNT error:", msg);
break;
default:
console.log("Other message:", msg);
}
} else {
// Binary data -> audio frame
handleAudioFrame(event.data);
}
};
Handling audio frames:
You’ll decode and stream these frames into your audio pipeline. For 16‑bit PCM over WebSocket:
function handleAudioFrame(data: Blob | ArrayBuffer) {
if (data instanceof Blob) {
data.arrayBuffer().then(playPcmFrame);
} else {
playPcmFrame(data);
}
}
const audioContext = new AudioContext({ sampleRate: 24000 });
let pcmBufferQueue: Float32Array[] = [];
function playPcmFrame(buffer: ArrayBuffer) {
const view = new DataView(buffer);
const floatData = new Float32Array(view.byteLength / 2);
for (let i = 0; i < floatData.length; i++) {
const sample = view.getInt16(i * 2, true); // little-endian
floatData[i] = sample / 32768; // normalize
}
const audioBuffer = audioContext.createBuffer(
1,
floatData.length,
audioContext.sampleRate
);
audioBuffer.copyToChannel(floatData, 0);
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
// Start immediately; for smoother playback you can implement a small jitter buffer
source.start();
}
For games or native apps, replace this with your engine’s audio buffer mechanism (Unity’s OnAudioFilterRead, iOS AVAudioEngine, etc.).
5. Support multiple turns and barge‑in
Because the session is full‑duplex and long‑lived:
- Multiple turns: Keep the WebSocket open as your agent converses. Your NLU / LLM logic runs in parallel and sends new
input.textmessages for each assistant reply. - Barge‑in: When a user starts talking, you can:
- Detect speech via your ASR pipeline.
- Send an
input.cancelfor the current utterance ID. - Immediately send a new
input.textresponse (once the LLM responds).
Example barge-in control message (pseudo):
function cancelUtterance(utteranceId: string) {
ws.send(JSON.stringify({
type: "input.cancel",
id: utteranceId,
}));
}
This is where LMNT’s 150–200ms streaming shines: the user hears speech almost immediately, but you still retain control to interrupt mid-sentence when needed.
6. Cleanly end the session
When the conversation is done, send a session termination message (if defined in the spec) and close the WebSocket:
function endSession() {
const msg = { type: "session.end" };
ws.send(JSON.stringify(msg));
ws.close();
}
On reconnect, repeat the session start/config sequence.
Common Mistakes to Avoid
-
Treating WebSocket like batch HTTP:
If you send one biginput.textand wait for the audio to fully stream before sending the next, you lose the point of full‑duplex.
How to avoid it: Stream shorter utterances or incremental chunks from your LLM and send them as they’re ready. -
Ignoring playback buffering and backpressure:
Writing frames straight to speakers without a small buffer can cause pops, gaps, or timing drift, especially on mobile.
How to avoid it: Implement a tiny jitter buffer (e.g., 50–100ms); track queue length and handle overflow by dropping oldest frames if you cancel an utterance. -
Re‑creating the WebSocket per utterance:
Opening/closing a connection for each turn adds extra latency and resource churn.
How to avoid it: Keep one LMNT WebSocket per active agent or user session and reuse it across many messages.
Real-World Example
Imagine you’re building a “newsreader” agent: it pulls stories from https://text.npr.org/ and reads them in LMNT’s brandon voice. You:
- Connect to LMNT’s realtime WebSocket endpoint.
- Send a
session.startmessage choosingvoice: "brandon",language: "en-US",sample_rate: 24000,format: "pcm16". - Fetch headlines from
https://text.npr.org/, chunk each headline to a sentence or two, and send each as its owninput.textmessage. - As LMNT streams back audio frames for the first headline, you begin playback immediately. While it’s speaking, your HTTP client fetches the next story.
- If the user taps “Next story,” you send
input.cancelfor the current utterance, flush the playback buffer, and immediately send a newinput.textwith the next headline.
Result: users experience “instant” narration with broadcaster-quality delivery, and the agent feels responsive even on slow networks because TTS latency stays in the 150–200ms range.
Pro Tip: Start by prototyping your agent’s speech flow using LMNT’s free Playground and the
brandonvoice, then move to WebSocket streaming with the same parameters. Matching Playground settings to your code reduces “it sounds different in production” debugging.
Summary
Implementing an LMNT real-time speech session over WebSocket (full-duplex streaming) is mostly about managing one long-lived, low-latency connection:
- Open a WebSocket, send a
session.startwith voice, language, and format. - Stream
input.text(and control messages) while receiving audio frames and events. - Decode and play audio in a low-latency pipeline with a small buffer.
- Keep the session alive for multiple turns, supporting barge‑in with
input.cancel, then gracefully end the session when finished.
With LMNT’s low-latency streaming, 24 languages, and no concurrency or rate limits, this pattern scales cleanly from a single agent demo to a fleet of production assistants and in‑game characters.