
How do I start a real-time conversation using Tavus CVI as a developer—what are the first steps?
You’re a developer, you want a real-time face-to-face AI Human in your app, and you’ve heard Tavus CVI is how you get there. The good news: the first steps are straightforward if you think in terms of one conversation pipeline—create an agent, connect in real time, and start streaming audio and video at the speed of human interaction.
Quick Answer: Tavus CVI lets you spin up a real-time AI Human by creating a developer account, configuring an agent, and opening a live WebRTC/WS session where you stream user audio/video in and receive lifelike video, voice, and responses back in sub-second time. As a developer, your first steps are setting up auth, choosing how you’ll embed the session, and wiring the perception → ASR → LLM → TTS → rendering loop into your product.
The Quick Overview
- What It Is: Tavus CVI is the real-time conversational video interface that powers Tavus AI Humans—live, face-to-face agents that can see, hear, and respond like a person inside your product.
- Who It Is For: Developers, founders, and teams who want to embed white-labeled, human-like AI experiences into apps, workflows, or enterprise systems without building the full perception and rendering stack themselves.
- Core Problem Solved: Most “AI assistants” are just disembodied chat. Tavus CVI solves the presence gap—making your agent feel like a live human on a call, with expressive facial behavior, timing that matches real conversation, and perception of tone, screenshare, and surroundings.
How It Works
At a systems level, starting a real-time conversation with Tavus CVI is about wiring into a single interactive loop:
- Your client sends live audio (and optionally video/screenshare) from the user.
- Tavus handles perception and understanding:
- Perception (Raven-1): Sees the scene, reads expressions, tracks attention.
- Speech Recognition: Transcribes the user’s speech in real time.
- LLM Orchestration: Determines the best response and next action.
- TTS + Phoenix-4 Rendering: Speaks and animates the AI Human with lifelike, temporally consistent facial behavior.
- Your app receives low-latency video and audio back and renders the AI Human in a face-to-face call.
From your perspective, it boils down to:
- Account + API Setup
- Create / Configure an AI Human
- Start a Real-Time Session (WebRTC/WS)
- Stream Media + Messages
- Handle Responses and End the Call
Below is how those phases play out in practice.
1. Set up your Developer Account and API access
Before you can start any real-time conversation, you need a Tavus Developer Account.
-
Create a Developer Account
- Go to the Tavus platform sign-up:
https://platform.tavus.io/auth/sign-up?is_developer=true - Choose Developer Account (not PALs).
- Complete onboarding steps (email verification, org setup as prompted).
- Go to the Tavus platform sign-up:
-
Get your API keys / credentials
- Once in the dashboard, navigate to your developer settings / API access.
- Generate a secret API key and/or any client credentials required for:
- REST calls to the Tavus backend.
- Authenticated real-time connections.
- Store secrets securely (e.g., environment variables, secret manager). Never expose them in client-side code.
-
Decide where your “call” lives You have two basic integration patterns:
- Web / SPA integration: Embed a Tavus-powered video component in your browser app (React, Vue, plain JS) using WebRTC, with your backend handling auth/token issuance.
- Native / server-side control: Manage sessions from a backend service while clients connect over WebRTC/WS to a URL or token you issue.
At this stage, you should have:
- A developer account,
- At least one API key, and
- A clear idea of where the real-time session will be initiated from (web, mobile, desktop).
2. Create and configure your AI Human
Next, you define who the user is talking to and how that AI Human behaves.
-
Create an AI Human / agent
- Use the Tavus dashboard or API to create a new AI Human (sometimes called an agent or video agent).
- Configure:
- Persona: Name, role (e.g., “Product Specialist,” “Onboarding Guide,” “Tutor”).
- Voice: Language, accent, tone.
- Visual identity: Face / style, if the product supports multiple looks.
- Knowledge / context: Base instructions, allowed tools/APIs, and any proprietary knowledge you want it to use.
-
Set system behavior and prompt scaffolding Typical options you’ll configure:
- System message / base instructions
Example: “You are a patient, friendly AI Human who helps users onboard to our product. You’re concise, ask clarifying questions, and never share internal IDs. You can see the user’s screen and body language and should acknowledge confusion or frustration explicitly.” - Allowed actions / tools
- Call internal APIs (e.g., CRM, support, scheduling).
- Trigger side effects (e.g., “Sends that email,” “Moves your meeting,” “Creates a support ticket”).
- Safety / compliance constraints suitable for your domain.
- System message / base instructions
-
Configure language and latency requirements
- Set the supported languages (Tavus supports 30+ languages).
- Make sure any latency-sensitive options are enabled or set to defaults designed for sub-second turn-taking.
When you’re done, you’ll have:
- An agent ID
- A configuration that defines personality, behavior, and capabilities
You’ll use that agent ID when creating real-time sessions.
3. Start a real-time conversation session
Now you’re ready to actually “call” your AI Human.
At a high level, starting a real-time Tavus CVI conversation looks like:
- Create a session through Tavus’s API.
- Obtain a connection token / session descriptor.
- Establish a real-time connection from the client via WebRTC (media) plus WebSocket (control), depending on the SDK.
A typical flow:
-
Backend: Create a session
- Make an authenticated call from your server to Tavus to create a real-time conversation session:
- Include the agent ID.
- Optionally include initial context (user ID, account tier, current screen, etc.).
- Receive:
- A session ID and
- A client token or connection URL for your front end.
- Make an authenticated call from your server to Tavus to create a real-time conversation session:
-
Frontend: Connect to the session
- Your app requests a token from your backend.
- With that token, the client:
- Creates a WebRTC peer connection to Tavus’s media servers.
- Optionally establishes a WebSocket for events and control messages (start/stop speaking, meta info, etc.).
- Once ICE negotiation is done, you have a live, low-latency media channel.
-
Attach media streams
- From the client:
- Capture microphone (and camera/screenshare if desired).
- Add those tracks to the WebRTC connection.
- From Tavus:
- Receive video of the AI Human rendered by Phoenix-4.
- Receive audio from TTS, timed and managed by Sparrow-1 so responses feel natural and interruptible.
- From the client:
At this point, you have a live face-to-face call between your user and a Tavus AI Human.
Features & Benefits Breakdown
| Core Feature | What It Does | Primary Benefit |
|---|---|---|
| Real-Time AI Humans | Streams lifelike video and voice of an AI Human over WebRTC with sub-second latency. | Gives users the feeling of talking to a real person, not a chatbot wearing a face. |
| Multimodal Perception | Uses Raven-1 to interpret voice, expressions, and on-screen context. | Lets the agent respond to tone, micro-expressions, and what’s on-screen, not just text. |
| Conversation-Oriented Stack | Orchestrates perception → ASR → LLM → TTS → Phoenix-4 rendering in one pipeline. | Keeps the entire conversation adaptive, expressive, and fluid at the speed of human interaction. |
Ideal Use Cases
- Best for interactive product experiences: Because it lets you embed an AI Human directly inside your app to guide onboarding, support, or training in real time, with the agent actually “seeing” the user’s screen and reactions.
- Best for enterprise workflows and sales flows: Because you can deploy AI SDRs, support reps, or trainers that maintain consistent quality, respond in 30+ languages, and integrate with your stack (CRM, ticketing, internal APIs) while still feeling like a live human on a call.
Limitations & Considerations
- Not a prerecorded video system: Tavus CVI is engineered for live, real-time interaction. If your primary need is batch, asynchronous video generation, you’ll be stretching the stack beyond what it’s optimized for.
- Real-time media requirements: Because this is genuine low-latency WebRTC, you should plan for:
- Stable network conditions,
- Browser/device support and permissions,
- Backend auth and token management. Building robust reconnection and error handling is part of doing human-quality presence at scale.
Pricing & Plans
Tavus keeps plans oriented around how you use the system:
-
Developer Accounts:
Best for developers, founders, and teams integrating Tavus into a product. You’ll use APIs, real-time sessions, and white-labeled AI Humans embedded in your app. Pricing typically reflects interaction volume, concurrency, and enterprise features (uptime guarantees, support). -
PALs Accounts:
Best for individuals looking to talk, explore, and connect with a personal AI companion. PALs are your always-present AI friend, not a white-labeled integration; they’re better suited for personal use than for building a product.
For detailed pricing and enterprise options, start with a developer account and then talk to Tavus about scaling, SLAs, and dedicated support for your use case.
Frequently Asked Questions
Do I need to be a WebRTC expert to start a real-time conversation with Tavus CVI?
Short Answer: No—but you should be comfortable with basic real-time media concepts and follow the provided SDK patterns.
Details:
Tavus handles the hard parts of rendering, perception, and timing. On your side, you’ll:
- Obtain a session token via your backend.
- Initialize the client connection using Tavus’s SDKs or example snippets.
- Attach microphone (and optional camera/screenshare) tracks.
- Render the incoming video and audio in your UI.
If you’ve ever integrated a WebRTC video call SDK before, the pattern will feel familiar—just with an AI Human on the other side instead of another user. If you haven’t, you’ll still be able to follow the standard “getUserMedia → connect → attach tracks” pattern.
How do I pass context (like user ID or screenshare details) into a Tavus CVI conversation?
Short Answer: You include context when you create the session and via real-time messages during the call.
Details:
When your backend creates a Tavus session, you can:
- Attach user identifiers and metadata (plan tier, prior interactions, CRM IDs).
- Provide initial instructions or context (“User is on billing settings page,” “User looks frustrated based on prior signals”).
- Use Tavus’s real-time messaging channel (usually alongside WebRTC) to:
- Send updates when the user changes screens.
- Notify the AI Human of important events (error states, timeouts).
- Trigger agentic actions (e.g., “Offer to reschedule meeting,” “Open a support ticket”).
This lets the AI Human talk as if it’s inside your app with the user—because, in effect, it is.
Summary
Starting a real-time conversation with Tavus CVI as a developer is about plugging into a single, human-speed loop: you stream audio and context in, and Tavus sends back a lifelike AI Human that sees, hears, and responds like a person. The first steps are:
- Create a Developer Account and secure API access.
- Define your AI Human—persona, knowledge, and behavior.
- Spin up a real-time session and connect via WebRTC.
- Stream user audio/video in and render the AI Human’s video/voice out.
From there, you layer on multimodal context, actions, and integrations until your users forget they’re talking to software.