
Can I use Tavus CVI with my own LLM and my own voice provider, and how do I switch after prototyping?
You can. Tavus CVI is built so you can start fast with Tavus defaults, then swap in your own LLM and voice stack once you’re ready to productionize. The key is understanding how Tavus’s perception → ASR → LLM → TTS → real-time AI Human pipeline is wired, and where you’re allowed to plug in your own models.
Quick Answer: You can use Tavus CVI with your own LLM and your own voice provider. Most teams prototype on the fully managed Tavus stack, then switch to custom LLM / TTS by swapping API endpoints and updating a few configuration flags, without rebuilding the video or perception layers.
The Quick Overview
- What It Is: Tavus CVI (Conversational Video Interface) is the real-time pipeline that powers Tavus AI Humans—handling video rendering, perception, speech, and dialogue so you can ship live, face-to-face agents via a single API.
- Who It Is For: Developers and product teams who want lifelike, on-screen AI Humans, but need control over the brain (LLM) and the voice (TTS/voice provider) for compliance, brand, or cost reasons.
- Core Problem Solved: You shouldn’t have to rebuild WebRTC, real-time rendering, multimodal perception, and conversation flow just to use your own models. Tavus CVI lets you plug your LLM and voice into a proven human computing stack.
How It Works
At a high level, Tavus CVI is a streaming conversational loop:
- Perception: Raven-style perception models watch and listen in real time—capturing voice, timing, and what the user is showing (screenshare, surroundings).
- Understanding & Response: Speech recognition turns audio into text, which flows into an LLM. The LLM decides what to say and what to do (agentic actions, tools, API calls).
- Expression & Rendering: A TTS system generates voice audio, and Phoenix-like rendering turns that into high-fidelity, temporally consistent facial behavior for the AI Human—synced to speech and responsive to context.
Where you customize:
- The LLM layer: route text into your own model/server instead of Tavus’s default LLM.
- The voice/TTS layer: route text into your preferred voice provider, then stream the resulting audio back into Tavus for real-time lip-sync and facial expression.
A typical path looks like:
-
Phase 1 – Prototype on the default stack:
Use Tavus’s built-in LLM and voice to validate UX, latency, and core flows. No infra work, just call the Tavus APIs and embed the AI Human. -
Phase 2 – Gradual swap to your LLM:
Once your prompts, tools, and policies are stable, point Tavus’s “LLM callback” or “orchestration hook” to your own LLM endpoint. Tavus keeps handling perception, turn-taking, and rendering while your model generates the text. -
Phase 3 – Bring your own voice provider:
When you’re ready to lock in a specific brand voice or existing TTS vendor, configure Tavus to call your TTS API (or accept your audio stream), then pass the resulting audio back through CVI for real-time mouth movement and expressions.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Pluggable LLM brain | Routes user transcripts to Tavus’s default LLM or your custom LLM endpoint. | Keep control of reasoning, safety, and data residency while using CVI. |
| Bring-your-own voice provider | Accepts audio from your TTS/voice stack and syncs it to the AI Human. | Preserve your brand voice and existing speech vendor without losing video. |
| Real-time AI Human pipeline | Handles perception, ASR, timing, and rendering at sub-second latency. | Avoid WebRTC/rendering complexity and focus on your conversation logic. |
Using Tavus CVI With Your Own LLM
You can think of the LLM as just one module in the Tavus loop. Tavus listens, understands, hands off to a “brain,” and then expresses the result through a live AI Human on-screen.
Default Mode: Tavus-managed LLM
When you spin up an AI Human with default settings:
- Audio is captured and transcribed (ASR).
- The transcript, plus perception context (tone, timing, what’s on-screen), goes into Tavus’s internal LLM stack.
- The LLM generates the response text and any tool calls.
- TTS + Phoenix-like rendering take over to speak and animate the AI Human.
This is the fastest way to test your product idea: you don’t manage inference or prompt routing at all.
Custom Mode: Your Own LLM
To use your own LLM with Tavus CVI:
-
Enable external LLM handling
In your Tavus developer setup, configure the conversation engine so that transcripts are forwarded to your service (e.g., via webhook or streaming callback). This is effectively a “bring-your-own-brain” mode. -
Implement your LLM endpoint
Your endpoint receives:- User transcript text (and optionally full conversation history)
- Perception metadata (e.g., interruptions, sentiment, screen context) if you choose to consume it
- Any tool state / session identifiers
It returns:
- Response text to be spoken
- Optional tool actions (API calls, workflow steps)
- Optional control signals (e.g., “respond briefly,” “yield to user,” “end call”)
-
Keep messages streaming-friendly
Tavus CVI is real-time. For the AI Human to feel present, it’s best if your LLM can:- Stream partial tokens, so Tavus can start TTS/rendering early.
- Respect turn-taking constraints (don’t block for long periods).
- Handle interruptions (if the user talks over the AI, your LLM should gracefully cut or adapt its reply).
-
Handle errors and timeouts
If your LLM endpoint fails or times out, define a fallback:- Let Tavus’s default LLM take over for that turn, or
- Have your service send a short “recovery” line (e.g., “Give me a second, I’m still pulling that up”) to keep the interaction human.
When You’d Choose Your Own LLM
You’ll likely switch to a custom LLM when you need:
- Strict data controls: e.g., healthcare/finance where logs must stay in your VPC.
- Domain finetuning: dense proprietary knowledge, tools, and workflows.
- Model choice: you’ve committed to a specific vendor or in-house model.
- Custom safety policies: you want your own filters, red teams, and guardrails.
Tavus CVI still handles the human side—perception and expression—while your LLM dictates what the AI Human says and does.
Using Tavus CVI With Your Own Voice Provider
Voice is the bridge between “this feels like a UI” and “this feels like a person.” Tavus lets you preserve the voice stack you already trust.
Default Mode: Tavus-managed Voice
By default, Tavus:
- Uses built-in TTS to turn response text into audio.
- Syncs that audio with Phoenix-style facial behavior for the AI Human.
- Optimizes for sub-second perceived latency.
You don’t need to think about sampling rates, phoneme timing, or lip-sync. It just works.
Custom Mode: Your Own TTS / Voice Provider
To use Tavus CVI with your own voice provider:
-
Configure external TTS
In your Tavus project settings or session config, switch the TTS mode from “managed” to “external” (naming may vary in docs). This tells Tavus that:- It should send the response text to your TTS service, or
- It should expect audio from your system instead of generating its own.
-
Choose your integration path
There are two common patterns:
-
Pattern A – Tavus → your TTS → Tavus (recommended)
- Tavus sends the response text to your TTS endpoint (via API).
- Your TTS returns audio (or a streaming audio response).
- Tavus CVI takes that audio and drives real-time mouth movement and expression.
-
Pattern B – Your app orchestrates TTS
- Your LLM returns text to your backend.
- Your backend calls your TTS vendor.
- Your backend streams the audio to Tavus via a defined audio ingest interface.
Pattern A is simpler if you want Tavus to orchestrate the entire conversational loop. Pattern B is better if your TTS is already deeply embedded in other systems.
-
-
Match audio format requirements
To keep the AI Human’s face in sync, follow Tavus’s audio constraints (see docs for exact values), typically including:
- Supported codecs (e.g., Opus/PCM)
- Sample rate
- Channel configuration (mono vs stereo)
- Chunk size for streaming
-
Optimize for latency
Real-time AI Humans only feel real if they respond at human speed. With your own voice provider:
- Prefer low-latency, streaming TTS where Tavus can start rendering as soon as the first audio frames arrive.
- Avoid long batching; aim for sub-second time-to-first-byte.
- Test on real networks (4G, typical Wi‑Fi) to ensure it still feels like a conversation, not a lecture.
Why Teams Bring Their Own Voice
Most teams switch to custom voice when they need:
- A consistent brand identity (matching existing IVR, ads, or voice products).
- Speaker-specific clones with their current TTS provider.
- Vendor consolidation around a chosen speech stack.
- Specific language / accent coverage that an internal engine doesn’t yet prioritize.
Tavus keeps the AI Human expressive—smiles, blinks, micropauses—while your TTS keeps the sound on-brand.
How to Switch After Prototyping: Step-by-Step
You don’t have to pick your LLM and voice architecture on day one. Most teams follow a staged migration.
Phase 1: Prototype With Fully Managed Tavus CVI
- Spin up a Tavus Developer Account.
- Use the default LLM and TTS settings.
- Embed an AI Human in your app using the standard Tavus SDK.
- Validate:
- Conversation flows
- UX and call length
- Basic latency and user reactions
At this stage, your code is mostly just:
- Creating sessions.
- Passing user audio/video.
- Receiving video stream and events.
No custom LLM or TTS calls yet.
Phase 2: Swap in Your Own LLM
When your prompts, tools, and policies are ready:
-
Turn on external LLM mode
- Update your Tavus integration to send transcripts to your LLM endpoint (via webhook/streaming callback detailed in the docs).
- Provide an authentication mechanism (API key, OAuth, etc.) for Tavus to call your endpoint.
-
Implement and test LLM handler
- Confirm your LLM responds under your target latency budget.
- Support streaming tokens if possible.
- Ensure your handler returns clean text and any structured actions you need.
-
Run side-by-side sessions
- Route a portion of sessions to your LLM and keep some on Tavus’s default as a fallback.
- Compare:
- Response quality
- Latency
- Safety policy adherence
-
Promote your LLM to default
Once stable, make your LLM the primary brain, with Tavus’s LLM as a backup option if you choose.
Phase 3: Swap in Your Own Voice Provider
When it’s time to fully brand the voice:
-
Enable external TTS configuration
- In your Tavus project/session configuration, set TTS mode to use your provider.
- Provide the TTS endpoint and credentials.
-
Wire response text → TTS → Tavus
- If Tavus is orchestrating:
- Tavus sends text to your TTS.
- Your TTS returns/streams audio back.
- If you orchestrate:
- Your backend gets text from the LLM.
- It calls your TTS, then streams the audio to Tavus.
- If Tavus is orchestrating:
-
Tune for smooth lip-sync and timing
- Confirm audio format compatibility.
- Test short and long utterances and interruptions.
- Check that micro-pauses, sentence breaks, and emphasis feel natural when rendered on the AI Human.
-
Gradually roll out
- Start with internal users or a small cohort.
- Monitor:
- Latency end-to-end (user question → AI Human starts speaking).
- Perceived quality (does it feel human, or slightly robotic?).
- Reliability (audio dropouts, TTS errors, retries).
Because Tavus owns the rendering loop, you can iterate on voice and models without touching the front-end video integration.
Ideal Use Cases
-
Best for teams with an existing AI stack:
Because it lets you plug your LLM and TTS into a real-time AI Human without rebuilding video, WebRTC, or perception from scratch. -
Best for regulated or enterprise workloads:
Because you can keep sensitive data, prompts, and logs in your own LLM infrastructure and voice provider, while relying on Tavus for enterprise-grade, real-time video performance.
Limitations & Considerations
-
Custom LLM / TTS requires engineering lift:
While Tavus simplifies the real-time video and perception stack, you still need to operate and scale your own LLM and TTS infrastructure. Plan for monitoring, rate limits, and failover. -
Latency budget is shared:
The more you add (custom tools, slow LLMs, heavy TTS), the more risk you take on interaction delay. To keep a human feel, optimize each step—LLM inference time, TTS generation, and network hops.
Pricing & Plans
Tavus pricing for CVI is structured around usage (interactions/minutes) and deployment tier, with different paths for builders versus individuals.
For this specific “bring your own LLM and voice” setup, you’ll want a Developer Account:
- It exposes the APIs, callbacks, and configuration options needed to plug in your own models.
- It’s designed for products embedding white-labeled AI Humans into apps and workflows.
Your LLM and TTS vendor costs are separate—you pay those providers directly. Tavus focuses on the real-time AI Human and CVI pipeline.
- Developer Account: Best for engineers, founders, and product teams needing APIs, white-labeling, and integration with their own LLM/voice stack.
- PALs Account: Best for individuals who just want a personal AI companion and don’t need to swap out underlying models or providers.
(For the most current pricing and feature breakdown, check the Tavus site or dashboard—plans and limits can evolve as the platform expands.)
Frequently Asked Questions
Can I use Tavus CVI with my own LLM from day one, or do I have to start with the default?
Short Answer: You can wire in your own LLM from the start, but most teams prototype on the default LLM first for speed.
Details:
If you already have a production LLM stack, you can immediately configure Tavus CVI to send transcripts to your endpoint and use your responses for the AI Human. That said, the quickest way to validate UX, latency, and basic flows is to start on the Tavus-managed LLM and flip to your own model once your prompts, tools, and safety policies are ready. The switch is mostly a configuration and endpoint change, not a full re-architecture.
Will switching to my own TTS or voice provider break lip-sync or facial expressions?
Short Answer: No, as long as you follow Tavus’s audio format and streaming guidelines.
Details:
Tavus’s rendering is designed to treat your TTS audio as a first-class input. Phoenix-style facial behavior models consume the audio stream and generate temporally consistent mouth movements, eye contact, and micro-expressions in real time. If your TTS respects the expected codecs, sample rates, and streaming behavior, the AI Human will stay in sync and expressive. Any desync issues you see are usually latency-related (large buffering, non-streaming TTS) and can be fixed by switching to streaming or tuning chunk sizes.
Summary
You don’t have to choose between “fully managed black-box AI” and “building your own video agent from scratch.” Tavus CVI is the middle path: a real-time, face-to-face AI Human pipeline that you can plug your own LLM and voice provider into.
Prototype fast on Tavus’s defaults. When you’re ready, point the LLM callbacks to your own model, and route response text through your preferred TTS. Tavus keeps handling perception, timing, rendering, and sub-second presence so your users feel like they’re talking to a real person, not a chatbot wearing a face.