
How do I create a voice clone in LMNT from a short audio sample, and what file format/settings should I use?
Most teams are surprised how little audio they need to get a great LMNT voice clone running in production. If you can capture a clean 5-second recording in a standard format (like WAV or high-bitrate MP3), you have enough to create a studio-quality clone and start streaming speech with 150–200 ms latency for your agents, apps, and games.
Quick Answer: To create a voice clone in LMNT, record ~5 seconds of clean, single-speaker audio, upload it via the Playground or API, and save it as a reusable voice. Use a common format like 16-bit PCM WAV or a high-quality MP3, recorded at 44.1 kHz or 48 kHz, mono, with no background music or effects. Once the clone is created, you can generate low-latency streaming speech in up to 24 languages using that voice.
Why This Matters
Voice cloning is where your product stops sounding like a generic AI and starts sounding like a character, brand, or persona users actually remember. When you can spin up studio-quality voice clones from just a few seconds of audio, you remove the biggest blockers to deploying conversational agents and game characters at scale: long recording sessions, expensive studio time, and brittle, latency-heavy TTS stacks.
For builders, the real unlock is speed:
- Capture a short sample.
- Clone once.
- Use that voice everywhere—Playground, API, demos, and production—without fighting rate limits or concurrency caps.
Key Benefits:
- Fast setup from tiny samples: All you need is a 5 second recording to get a studio-quality clone instead of scheduling full voiceover sessions.
- Production-ready latency: Cloned voices stream in 150–200 ms, so the experience feels conversational instead of “AI voiceover.”
- Scales with your product: Create many clones per project, use them across agents and games, and rely on predictable character-based pricing with no concurrency or rate limits.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Short-sample voice cloning | Creating a reusable voice profile from just a few seconds of recorded audio. | Lets you capture voices quickly (actors, teammates, NPCs) and ship features without long studio sessions. |
| Clean source recording | An audio sample with a single speaker, minimal noise, no music, and consistent tone. | The quality of your clone is capped by the quality of your source—clean in, lifelike out. |
| Format & recording settings | Technical parameters like file type, sample rate, bit depth, channel count, and loudness. | Using standard formats (e.g., 16-bit PCM WAV, 44.1–48 kHz mono) avoids artifacts and makes cloning more accurate and robust. |
How It Works (Step-by-Step)
At a high level, creating a voice clone in LMNT is a three-part flow: record, upload, then generate. You can do this entirely in the Playground or via the Developer API.
1. Capture a clean 5-second recording
You don’t need a professional studio, but you do need a reasonable recording environment and sane settings.
Recommended recording settings
- Length: 5–15 seconds (LMNT can work with ~5 seconds; more is fine if it’s all clean and consistent).
- File format:
- Best: WAV (16-bit PCM, mono)
- Also fine: MP3 or M4A at 128 kbps+, CBR or high-quality VBR
- Sample rate: 44.1 kHz or 48 kHz
- Channels: Mono (stereo is okay but mono is simpler and sufficient)
- Loudness: No clipping; signal peaks around -6 dB is a good target.
Capture checklist
- Single speaker only, speaking naturally (no character filters, no voice changer).
- No background music, jingles, or sound effects.
- Minimal room echo—quiet room, soft furnishings help.
- Avoid plosives (p, b) popping the mic: keep the mic slightly off-axis.
- Use a neutral script: a short paragraph with a mix of consonants, vowels, and numbers (e.g., a short intro plus a sentence with dates or quantities).
2. Create the voice clone in LMNT (Playground or API)
Once you have the file, you can create the clone in two main ways:
Option A: Use the LMNT Playground
This is the fastest way to prototype.
-
Open the Playground
Go to LMNT’s free Playground from the main site navigation. -
Find the voice cloning flow
In the voice selection area, look for an option like “Create voice” or “Clone voice”. -
Upload your audio file
- Select your 5–15 second recording (WAV, MP3, or M4A).
- Confirm the speaker’s name or label (e.g.,
Support-Agent-A,Narrator-EN,NPC-Tony).
-
Create and save the clone
LMNT will process the sample into a studio quality voice clone.
Once done, it appears in your voice list so you can select it, enter text, and generate speech instantly. -
Test with multilingual output
- Type text in English or any of the 24 languages LMNT supports.
- Try code-switching mid-sentence (e.g., “Let’s switch to español for a second, then back to English.”) to confirm the cloned voice handles mixed-language delivery naturally.
Option B: Use the Developer API
When you’re ready to wire this into your app or pipeline:
-
Pull up your code editor
Visithttps://api.lmnt.com/specto view the API and generate an SDK call for your stack (Node, Python, Go, Rust, etc.). -
Upload the voice sample programmatically
- Create a request that sends your audio file and metadata (name, description).
- The response will include a voice ID you can store and reuse.
-
Use the clone in streaming TTS
- Call the streaming endpoint with your voice ID and text.
- LMNT returns low-latency audio (150–200 ms) suitable for conversational agents and games.
-
Scale across services
Share the stored voice ID across your services (agents, NPCs, tutors) instead of re-uploading audio.
3. Generate and tune output in your app
With the clone created, you can focus on delivery:
- Text → speech: Use the voice ID with the streaming TTS endpoint to read prompts, chat responses, or scripted lines.
- Realtime experiences: For agents or games, keep sessions open over WebSockets and stream chunks as you receive LLM tokens.
- Language & style:
- Switch among 24 languages, even mid-sentence.
- Adjust phrasing and punctuation in your prompts to nudge prosody (pauses, emphasis, etc.).
Because LMNT has no concurrency or rate limits, you can spin up many simultaneous sessions without redesigning your traffic patterns as you scale.
Common Mistakes to Avoid
-
Using noisy or “styled” audio as your only sample:
Background music, heavy reverb, or a voice changer will bake artifacts into the clone.
How to avoid it: Record a clean, dry take specifically for cloning. You can add effects later at playback time in your own stack if needed. -
Overcomplicating recording settings:
Recording at 96 kHz, stereo, ultra-high bit depth, or using aggressive post-processing doesn’t improve cloning performance and can even hurt it.
How to avoid it: Stick to standard, production-friendly settings—16-bit PCM WAV, 44.1–48 kHz, mono, light or no processing. -
Feeding multiple speakers into one sample:
A short dialogue clip or overlapping voices confuse the model about which voice to clone.
How to avoid it: Use a dedicated monologue. If you want multiple characters, record separate samples and make separate clones. -
Clipping and distortion:
An overloaded mic signal or compressed social-media audio (e.g., screen-recorded) can produce harsh, unnatural clones.
How to avoid it: Record directly from the mic, monitor your input, and re-take if you see consistent clipping.
Real-World Example
Say you’re building a streaming “History Tutor” agent similar to LMNT’s Vercel-hosted demo. You want the tutor to sound like your lead educator, but you only have a short window to record them.
Here’s the workflow:
- You sit your educator in a quiet conference room with a USB mic.
- They record a 10-second script: “Hi, I’m Alex, your history tutor. I’ll walk you through events, dates, and big ideas—and switch languages if that helps.”
- You export the recording as 16-bit PCM WAV, 48 kHz, mono, then upload it through the LMNT Playground’s Clone voice flow.
- LMNT creates a clone; you name it
Tutor-Alex. - In your Vercel app, you update the streaming TTS call to use the
Tutor-Alexvoice ID for all agent responses. - Now users chat with the tutor in English, occasionally switching to Spanish or French mid-sentence; streaming responses arrive in 150–200 ms, so the back-and-forth feels like a real conversation.
You didn’t need a studio day, a bespoke TTS training pipeline, or special formats—just a clean 10-second WAV file and a few minutes in the Playground.
Pro Tip: When you’re capturing a voice you’ll use across multiple experiences (support agent, tutor, in-game narrator), record two versions: a neutral baseline and a slightly more energetic read. Clone from the neutral one for maximum versatility, then use prompt wording (and punctuation) to dial up or down the energy in different apps.
Summary
To create a voice clone in LMNT from a short audio sample, focus on three things: clean capture, standard file formats, and a simple Playground → API workflow. A 5-second, mono WAV or high-bitrate MP3 recorded at 44.1–48 kHz is enough to produce a studio quality voice clone that can stream in 150–200 ms and speak 24 languages, including mid-sentence code-switching. Once cloned, that voice becomes just another ID you use in the Playground, in your API calls, and in forkable demos—letting you scale lifelike, low-latency voices across all your conversational apps, agents, and games.