How do I create a voice clone in LMNT from a short audio sample, and what file format/settings should I use?

Most builders are surprised how little audio you actually need to spin up a production-ready voice in LMNT—if you handle the capture and file settings correctly, you can get a studio-quality voice clone from a 5‑second recording and drop it straight into your conversational app, agent, or game.

Quick Answer: In LMNT, you create a voice clone by uploading a clean 5–30 second recording of a single speaker, then using that clone ID with the text-to-speech API or Playground. Use uncompressed or lightly compressed audio (WAV or high‑bitrate MP3), mono, 44.1 kHz or 48 kHz, with no background noise, music, or effects—just the dry voice, recorded close to the mic at a consistent level.

Why This Matters

Your voice clone is the foundation of the user experience. If the source audio is noisy, overprocessed, or captured with inconsistent tone, your agent or character will sound “off” no matter how good the LLM is.

Getting the file format, recording, and upload path right up front means:

You don’t have to re-record later when you discover artifacts at scale.
You can prototype quickly in the Playground, then move directly to API integration.
Your cloned voice stays stable and natural even under heavy load and real-time streaming.

Key Benefits:

Minimal input, maximum quality: All you need is a 5 second recording to get a studio-quality voice clone you can use across your stack.
Built for realtime voice: Clones plug into LMNT’s 150–200ms low-latency streaming, so they’re ready for conversational turn-taking.
Scales without limits: Unlimited clones across plans and no concurrency or rate limits, so you’re not blocked as your app grows.

Core Concepts & Key Points

Concept	Definition	Why it's important
Voice clone	A custom LMNT voice profile created from a short recording (e.g., 5–30 seconds) of a single speaker.	This is the “voice” your assistant, tutor, or character will use for all generated speech. Quality here drives everything downstream.
Source audio quality	How clean, consistent, and unprocessed your input recording is—mic choice, room noise, gain, and tone.	High-quality, dry audio helps LMNT match timbre and natural prosody, giving you a lifelike result from very little input.
File format & settings	The technical parameters of your uploaded file: container (WAV/MP3), sample rate, channel count, and bit depth/bitrate.	Correct settings reduce artifacts, avoid transcoding issues, and ensure your clone works reliably in API and streaming scenarios.

How It Works (Step-by-Step)

Here’s the typical flow I recommend teams follow when they ask how to create a voice clone in LMNT from a short audio sample and what file format/settings to use.

1. Capture a clean short recording

You don’t need a studio. You do need a controlled, repeatable setup.

Choose a quiet space.
- No HVAC hum, traffic, or cafe noise.
- Turn off music, system sounds, and notifications.
Use a decent mic.
- USB condenser or headset mic is usually enough.
- Get 4–8 inches from the mic, slightly off-axis to avoid plosives.
Record a neutral read.
- 5–30 seconds is ideal; LMNT can work from as little as 5 seconds.
- Use natural, conversational delivery; avoid whispering or shouting.
- Example script: a mix of vowels, consonants, and numbers:
  “Hi, I’m Maya. This is a short sample to create a voice clone in LMNT from a short audio recording. One, two, three. The quick brown fox jumps over the lazy dog.”
Keep it dry.
- No background music.
- No reverb, EQ, compression, or noise reduction if you can avoid it.
- Record once, don’t splice multiple sessions together.

2. Use recommended file formats and settings

When you export the recording, use settings that keep the audio clean and easy for LMNT to learn from.

Container / format:

Preferred: WAV (uncompressed, PCM)
Also works well: High‑bitrate MP3 (e.g., 192 kbps or higher), M4A / AAC from a good capture

Sample rate:

Recommended: 44.1 kHz or 48 kHz
Avoid very low sample rates (8–22 kHz) and exotic values.

Channels:

Mono (single channel) is best.
If you record in stereo, it’s fine to downmix to mono on export.

Bit depth / bitrate:

For WAV: 16‑bit or 24‑bit PCM is safe.
For MP3/AAC: 192 kbps+ CBR or high‑quality VBR.

Other settings:

No normalization that clips or squashes dynamics.
Peak levels around ‑6 dB to ‑3 dB (not hitting 0 dB).

If you’re using a DAW or recorder, a solid default is:
WAV, mono, 48 kHz, 24‑bit PCM.

3. Upload and create the voice clone in LMNT

You can approach this two ways: start in the Playground, or go straight to API.

Option A: Try it first in the Playground

This is the fastest way to verify your clone before wiring it into your app.

Go to the LMNT Playground from the main site.
Open the voice section and look for an option to create / add a custom voice.
Upload your recording (WAV or high‑quality MP3) with the settings above.
Name the voice (e.g., “SupportAgent_Maya_EN”).
Type some test text and generate speech to confirm:
- Timbre matches the source speaker.
- There are no weird artifacts or “room tone” baked in.
Test short and long utterances, and—if relevant—multilingual phrases, since all voices can speak in 24 languages and code-switch mid‑sentence.

Once you like the result, you’re ready to bring that clone into your build.

Option B: Create and use the clone via API

After testing in the Playground, most teams wire cloning into their setup or just use the saved voice ID.

Get your API key from the LMNT dashboard.
Create or reference the clone:
- If the platform exposes a “create voice” endpoint, you’ll POST your audio file and metadata (name, language hints, etc.).
- If you created the clone in the Playground, you’ll typically get a voice ID you can reference in text-to-speech calls.
Use the clone in TTS / streaming calls:
- Pass the voice ID to generate speech for:
  - Real-time agents (WebSocket streaming at ~150–200ms latency).
  - Game characters or NPCs.
  - Tutors or broadcasters (e.g., reading NPR headlines using the “brandon” voice style).

The core pattern: capture once → upload → get a voice ID → reuse that ID across your app and environments.

Common Mistakes to Avoid

Noisy or overprocessed audio:
Recording in a noisy room, using aggressive noise reduction, or slamming a compressor creates artifacts that LMNT will faithfully reproduce.
How to avoid it: Choose a quiet space, record dry, and skip heavy plugins. If you must clean noise, use the gentlest possible settings and listen for “swirling” or robotic tails before you upload.
Inconsistent delivery and multiple sessions:
Mixing different moods, distances, or microphones in one file confuses the model and weakens the clone.
How to avoid it: Record your 5–30 seconds in a single sitting, with steady tone and mic position. If you want multiple “characters,” create separate clones instead of one file that jumps around.

Real-World Example

You’re building a real-time customer support agent and want it to sound like your human team lead—clear, friendly, and trustworthy. Here’s how you’d practically handle how to create a voice clone in LMNT from a short audio sample and what file format/settings to use:

Capture: You book a quiet meeting room, plug in a USB mic, and record a 20‑second script in Audacity:
- Mono, 48 kHz, 24‑bit WAV.
- Team lead speaks in their normal “support” tone, no yelling or stage voice.
Export: You export as team_lead_support.wav with no effects, peaks around ‑3 dB.
Clone: In the LMNT Playground, you upload team_lead_support.wav, name the clone support_lead_en, and test a few sample phrases your agent will actually say.
Integrate: You grab the support_lead_en voice ID, plug it into your agent’s TTS streaming calls, and test turn-taking. With 150–200ms streaming latency, the agent now sounds like your team lead and responds fast enough to feel like a live voice on the line.

Pro Tip: When you’re recording, include a few sentences close to what the final product will say—if your agent reads order numbers, product names, or code snippets, include that flavor in the 5–30 second sample so the clone learns your real-world phoneme mix.

Summary

To create a strong LMNT voice clone from a short audio sample, focus your effort on capture and file settings, not on length. A clean 5–30 second mono WAV or high‑quality MP3—recorded in a quiet room, at 44.1/48 kHz, with natural delivery—is enough to get a studio-quality clone you can trust in production. Once it’s uploaded via the Playground or API and you have the voice ID, that clone drops straight into LMNT’s low-latency streaming so your conversational apps, agents, and games sound natural without introducing lag.

Next Step

Get Started

Answers you can trust, from Codeables