
How do I implement an LMNT real-time speech session over WebSocket (full-duplex streaming)?
Quick Answer: To implement an LMNT real-time speech session over WebSocket, open a full-duplex WebSocket connection to LMNT’s streaming TTS endpoint, send JSON control messages describing what to say (and in which voice), and consume the binary audio frames as they arrive with ~150–200ms end‑to‑end latency. Your app keeps the socket open so you can keep sending new text and receiving audio in a single conversational session.
Why This Matters
If you’re building conversational apps, agents, or games, your voice stack lives or dies on latency and turn-taking. A REST-style “send text → wait for full audio file” flow is too slow and too rigid; you need full‑duplex streaming so the model can start talking while you’re still generating or updating text. LMNT’s real-time WebSocket session is designed for exactly this: 150–200ms low-latency streaming, natural voices in 24 languages (including mid-sentence switching), and no concurrency or rate limits so you can scale from a single prototype to a live production service.
Key Benefits:
- Conversational latency: Audio starts in ~150–200ms, so your agents can talk and interrupt like real people.
- True full‑duplex: Send control/text messages while you receive audio frames on the same WebSocket—ideal for LLM streaming and turn-taking.
- Production‑ready scale: No concurrency or rate limits, predictable usage-based pricing, and SOC‑2 Type II posture when you need to go to prod.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Full‑duplex WebSocket | A bidirectional connection where you can send and receive messages simultaneously over a single socket. | Lets you stream text (or control events) to LMNT while continuously receiving audio back—critical for real-time agents. |
| Streaming TTS session | A long‑lived LMNT session that turns your incremental text (or prompts) into a continuous stream of audio frames. | You can keep a conversational context open, reduce setup overhead, and manage turn-taking within one connection. |
| Voice & language config | JSON parameters specifying voice ID, language, style, and options like code-switching. | Controls how your app “sounds” without changing the plumbing; LMNT voices speak 24 languages and can switch mid-sentence. |
How It Works (Step-by-Step)
At a high level, implementing an LMNT real-time speech session over WebSocket looks like this:
-
Authenticate & open the WebSocket
- Use your LMNT API key (typically via an
Authorizationheader or query token) to connect to the streaming endpoint (e.g.,wss://api.lmnt.com/v1/stream—check the latest in the LMNT API spec). - Open a WebSocket from your server or client, depending on your architecture (Node, Python, browser, Unity, etc.).
- Use your LMNT API key (typically via an
-
Send a session init message
- Once the socket is open, send a JSON control message to configure the session: voice, language, format, and any session options.
- LMNT returns an acknowledgement (e.g.,
"session_started") so you know you’re ready to stream.
-
Stream text & consume audio frames
- As your LLM or game engine produces text, send it in incremental JSON messages (e.g.,
"speak"or"append"events). - LMNT streams back audio frames (often as binary messages, sometimes tagged with small JSON envelopes).
- Play frames as they arrive using your audio stack (Web Audio API, WebRTC, game engine audio sources, etc.) instead of waiting for completion.
- As your LLM or game engine produces text, send it in incremental JSON messages (e.g.,
Below is a representative pattern you can adapt to your language of choice.
1. Open a full‑duplex WebSocket
Node.js example (server-side):
import WebSocket from 'ws';
const LMNT_WS_URL = 'wss://api.lmnt.com/v1/stream'; // check docs for current path
const LMNT_API_KEY = process.env.LMNT_API_KEY;
const ws = new WebSocket(LMNT_WS_URL, {
headers: {
Authorization: `Bearer ${LMNT_API_KEY}`,
},
});
ws.on('open', () => {
console.log('LMNT WebSocket connected');
// You can now send session init
});
ws.on('error', (err) => {
console.error('LMNT WebSocket error:', err);
});
ws.on('close', (code, reason) => {
console.log('LMNT WebSocket closed', code, reason.toString());
});
Client-side (browser) note:
If you connect directly from the browser, you’ll typically terminate LMNT from your backend (for security) and proxy a WebSocket (or WebRTC) to the client.
2. Initialize the session
After open, send a JSON message telling LMNT what kind of audio stream you want. Exact field names and shapes are in the LMNT API spec, but a typical shape looks like:
function initSession() {
const msg = {
type: 'session_init',
// A specific LMNT voice, e.g., "brandon" for an engaging broadcaster
voice_id: 'brandon',
// Language code; LMNT supports 24 languages and code-switching
language: 'en-US',
// For agents, you might want low-latency, streaming-friendly output
audio_format: 'pcm_s16le_16k', // example; check spec for supported formats
// Optional session parameters
metadata: {
app: 'my-realtime-agent',
conversation_id: 'user-1234',
},
};
ws.send(JSON.stringify(msg));
}
ws.on('open', initSession);
// Handle init acknowledgement
ws.on('message', (data, isBinary) => {
if (!isBinary) {
try {
const msg = JSON.parse(data.toString());
if (msg.type === 'session_started') {
console.log('LMNT session started', msg.session_id);
// Now start streaming text
}
} catch {
// not JSON, skip
}
}
});
You only do this once per conversation/session. Re‑use the session until the interaction ends.
3. Stream text and receive audio in real time
Once the session is active, you send speech requests and receive audio concurrently.
Sending text incrementally
This is where full‑duplex matters: as your LLM streams tokens, you can drip them into LMNT without waiting for the full response.
function sendTextChunk(text) {
const msg = {
type: 'speak', // or "append", "utterance", etc. per spec
text,
// Optional: an utterance ID so you can correlate audio with text
utterance_id: `utt-${Date.now()}`,
};
ws.send(JSON.stringify(msg));
}
// Example: piping LLM token stream into LMNT
llm.on('partial', (partialText) => {
sendTextChunk(partialText);
});
llm.on('done', () => {
// Optionally signal that this utterance is complete
ws.send(JSON.stringify({ type: 'utterance_end' }));
});
You can also send control messages mid-flow (e.g., to change voice, pause, or stop) without closing the socket:
function stopSpeaking() {
ws.send(JSON.stringify({ type: 'stop' }));
}
Receiving and playing audio frames
LMNT will stream back audio frames with low latency. Many implementations use:
- Binary WebSocket frames for raw audio chunks
- Occasional JSON messages to signal events (
start,end, errors, etc.)
You want to:
- Detect binary vs JSON frames.
- Buffer the audio in a small jitter buffer.
- Feed it into your player (Web Audio, an audio queue in a game engine, etc.).
Node example with a simple PCM buffer:
const audioFrames = [];
ws.on('message', (data, isBinary) => {
if (isBinary) {
// This is an audio frame
audioFrames.push(data);
// In a real app, push into your playback pipeline immediately
handleAudioFrame(data);
} else {
const msg = JSON.parse(data.toString());
switch (msg.type) {
case 'audio_started':
console.log('Audio started for', msg.utterance_id);
break;
case 'audio_finished':
console.log('Audio finished for', msg.utterance_id);
break;
case 'error':
console.error('LMNT error:', msg);
break;
default:
// handle other events
break;
}
}
});
function handleAudioFrame(frameBuffer) {
// Server-side: forward to your client via WebSocket/WebRTC
// Client-side: enqueue into Web Audio Source / AudioWorklet
}
On the client, you might decode PCM using an AudioWorkletProcessor or a simple script node, or you can choose an encoded format (Opus, etc.) if LMNT offers it and decode in your playback layer.
Common Mistakes to Avoid
-
Treating the WebSocket like a one-shot HTTP call:
Don’t open a socket per sentence and immediately close it. Keep the WebSocket alive for the whole conversation so you avoid handshake overhead and can benefit from real full‑duplex streaming. -
Waiting for full text before speaking:
If you wait for your LLM to finish generating before you send anything to LMNT, you lose the latency advantage. Stream partial text chunks as they arrive; LMNT is designed for this kind of incremental input and 150–200ms response times. -
Ignoring backpressure and playback timing:
If you just dump every audio frame into playback without buffering, you may get stutters on unstable networks. Maintain a small jitter buffer (e.g., 100–200ms of audio) and respect backpressure from your audio output or downstream client. -
Leaking sessions / not handling reconnects:
Always handlecloseanderrorevents, and implement simple reconnect logic. Cleanly end sessions when the user hangs up or closes the tab.
Real-World Example
Imagine you’re building a web-based History Tutor similar to LMNT’s demo:
- The frontend streams microphone audio or text prompts to your backend.
- Your backend uses an LLM to generate a conversational answer token‑by‑token.
- As tokens arrive, you feed partial sentences into the LMNT WebSocket using
"speak"messages. - LMNT starts streaming audio back in ~150–200ms, which you relay to the browser over another WebSocket.
- The browser plays the audio frames as they come in, so the tutor starts answering almost immediately, and the student can interject or ask follow‑ups in real time.
You keep one LMNT session open per user, adjust the voice or language as needed (LMNT voices speak 24 languages and can code‑switch mid-sentence), and rely on “no concurrency or rate limits” to scale as more students join.
Pro Tip: Start by forking a working demo—like LMNT’s “History Tutor” (LLM-driven streaming speech on Vercel) or “Big Tony’s Auto Emporium” (realtime speech-to-speech using LiveKit)—and then swap in your own prompt logic and UI. You’ll get the full-duplex WebSocket wiring, audio pipeline, and reconnect behavior “for free” instead of re‑inventing them.
Summary
Implementing an LMNT real-time speech session over WebSocket is a straightforward pattern:
- Open a full‑duplex WebSocket to LMNT’s streaming endpoint with your API key.
- Send a
session_initJSON message to configure voice, language, and format. - Stream text (or partial LLM outputs) via JSON messages while receiving low‑latency audio frames on the same socket.
- Play those frames with a lightweight jitter buffer for smooth, conversational delivery.
Because LMNT is tuned for 150–200ms low‑latency streaming, 24 languages with natural code-switching, and no concurrency or rate limits, you can go from a prototype in the Playground to a production-grade agent, tutor, or game character without re‑architecting your voice stack later.