How do I start a real-time conversation using Tavus CVI as a developer—what are the first steps?
AI Video Agents

How do I start a real-time conversation using Tavus CVI as a developer—what are the first steps?

10 min read

You don’t start Tavus CVI by wiring up WebRTC from scratch. You start by telling Tavus who your AI Human should be, what context it should have, and how your app will connect to it in real time. From there, you plug into a single API and let Tavus handle perception → speech recognition → LLM → TTS → real-time video rendering at the speed of human interaction.

Quick Answer: As a developer, your first steps with Tavus CVI are: create a Developer Account, define or select your AI Human, obtain API credentials, and establish a real-time session (usually via WebRTC or a Tavus-provided SDK) that streams audio/video in and receives lifelike video + audio back.


The Quick Overview

  • What It Is: Tavus CVI (Conversational Video Interface) is the real-time layer that lets you embed face-to-face AI Humans into your product—over WebRTC, in the browser, or in your own native app.
  • Who It Is For: Developers and teams who want human-like, multimodal agents (not just chatbots with a UI skin) that can see, hear, and respond in real time.
  • Core Problem Solved: Most AI agents can answer questions, but they can’t hold a real conversation. Tavus CVI solves the “presence gap”—timing, micro-expressions, screenshare context, and real-time back-and-forth—so your AI Human feels like someone sitting across from you, not a voice in a box.

How It Works

At a high level, Tavus CVI is a real-time pipeline you connect to from your app. You send the agent live inputs; Tavus sends back a fully rendered AI Human—voice, face, expressions, and timing tuned to the conversation.

Under the hood, the loop looks like this:

  1. Perception: Your user streams audio, video, and optionally screenshare to Tavus. Raven‑1 handles what the agent “sees”—objects, faces, emotions, and where attention should go.
  2. Understanding & Dialogue: Tavus converts speech to text, passes it through your configured LLM and policies, and uses Sparrow‑1 to orchestrate when to speak, when to listen, and how to keep turn-taking sub‑second.
  3. Rendering in Real Time: Phoenix‑4 converts the agent’s response into finely tuned facial behavior and expressive video, aligned with generated speech, and streams it back over your established connection.

From a developer’s perspective, your first steps are:

  1. Create a Tavus Developer Account.
  2. Define or select an AI Human (persona, voice, visual identity).
  3. Use Tavus APIs/SDKs to:
    • Request a real-time conversation session (token/room).
    • Join that session with your frontend or client.
    • Start streaming and handling events.

Step 1: Create Your Tavus Developer Account

Your entry point is a Developer Account, not a PALs companion.

  1. Go to the Tavus platform:
    https://platform.tavus.io/auth/sign-up?is_developer=true
  2. Sign up as a Developer:
    • Intended use: integrating AI Humans into a product or workflow.
    • You’ll get access to:
      • API keys / credentials
      • Developer dashboard
      • Sample projects and configuration panels
  3. Secure and store your API keys (server-side). You’ll use these to:
    • Create sessions
    • Configure agent behavior
    • Manage usage and observability

Step 2: Define Your AI Human

You don’t just spin up a “generic agent.” You define an AI Human that fits your use case.

Typical configuration steps:

  1. Persona & Role

    • Example: “AI SDR for inbound demos,” “Onboarding coach,” “Support triage agent.”
    • Define tone, domain knowledge, and boundaries via system prompts and policies.
  2. Visual Identity & Behavior

    • Choose or configure:
      • Face / appearance
      • Default expressions and emotional range
      • Gesture style (calm, animated, instructional)
    • Phoenix‑4 ensures temporally consistent expressions—smiles that land on the punchline, concern when the user’s tone shifts.
  3. Voice & Language

    • Set:
      • Voice timbre and speaking style
      • Supported languages (Tavus supports 30+)
      • Speaking pace vs. responsiveness
  4. Context & Tools

    • Connect to:
      • Your APIs or knowledge bases
      • CRM or helpdesk systems
      • Calendars, email, or G‑Suite if the AI Human needs to “do” things (send emails, move meetings, log notes).

Once configured, this AI Human becomes a reusable agent profile you can attach to any real-time session.


Step 3: Request a Real-Time Conversation Session

To start a real-time conversation from your app, you first ask Tavus for a “room” or session the client can join.

In pseudocode:

POST /v1/realtime/sessions
Authorization: Bearer YOUR_SERVER_API_KEY
Content-Type: application/json

{
  "agent_id": "your-ai-human-id",
  "metadata": {
    "user_id": "123",
    "conversation_use_case": "product_demo"
  }
}

Typical response:

{
  "session_id": "sess_abc123",
  "token": "client_join_token",
  "expires_at": 1712345678,
  "signaling_url": "wss://realtime.tavus.io/signal"
}

You’ll pass the token plus signaling / server URLs to your frontend or client, which actually joins the conversation.

Developer pattern: your backend issues this session and returns only the client-safe token and URL to the browser/app. The server key stays off client devices.


Step 4: Connect Your Client (Web, Mobile, or Native)

Here’s where you wire your UI to Tavus CVI. You’re essentially creating a WebRTC-style call with an AI Human on the other side.

You will typically:

  1. Initialize the Tavus Client SDK

    • In the browser (JS/TS), mobile, or desktop.
    • Provide:
      • session_id or token
      • Signaling URL
      • Any UI callbacks (onTrack, onMessage, onError)
  2. Capture User Media

    • Request:
      • Microphone
      • Camera
      • Optional: screenshare
    • Send those streams to Tavus so Raven‑1 can see and hear the user.
  3. Receive the AI Human’s Streams

    • Subscribe to:
      • Downstream video track (Phoenix‑4 output)
      • Downstream audio track
    • Attach them to a <video> element, native video view, or custom renderer.

Example JS-like flow (pseudo):

import { TavusClient } from '@tavus/realtime';

const client = new TavusClient({
  token: clientJoinToken,
  signalingUrl: 'wss://realtime.tavus.io/signal'
});

await client.connect();

// Get user media
const userStream = await navigator.mediaDevices.getUserMedia({
  audio: true,
  video: true
});

client.addLocalStream(userStream);

// Render Tavus AI Human
client.on('remoteStream', (stream) => {
  const videoEl = document.getElementById('tavus-video');
  videoEl.srcObject = stream;
  videoEl.play();
});

Now your user is effectively on a video call with your AI Human.


Step 5: Manage Conversation Flow & Events

Tavus CVI is real time, so you’re handling events, not polling APIs.

Common events you’ll care about:

  • Conversation State

    • on('agentSpeaking') / on('agentIdle') to sync UI.
    • on('transcript') for live ASR transcripts (both sides).
  • User Interactions

    • Button clicks (“show me another product”), toggles (mute mic/cam).
    • Screen shares: pass a screenshare track so the agent can react to what’s on screen.
  • Agentic Actions

    • When the AI Human decides to send an email or book a meeting, you’ll get:
      • Webhooks to your backend
      • Real-time client events (“meetingScheduled”, “emailDrafted”)
  • Error / Disconnect Handling

    • on('disconnected'), on('error') to clean up UI and optionally re-establish a session.

You decide how “agentic” your AI Human should be:

  • Passive: answer questions, guide users.
  • Proactive: share recommendations, nudge users, ask clarifying questions.
  • Agentic: trigger your APIs, write to CRMs, move meetings.

Features & Benefits Breakdown

Core FeatureWhat It DoesPrimary Benefit
Real-Time, Face-to-Face AI HumansStreams lifelike video and audio from Phoenix‑4 in sync with conversation.Users feel like they’re talking to a person, not a disembodied bot.
Multimodal Perception (Voice, Video, Screenshare)Raven‑1 interprets tone, micro‑expressions, and on-screen content in real time.The AI Human can react to what users say, show, and feel—not just what they type.
Sub-Second Turn-Taking & Dialogue ControlSparrow‑1 orchestrates when to listen vs. speak, and how long to pause.Conversations flow like human dialogue, with minimal latency and awkward overlaps.

Ideal Use Cases

  • Best for real-time product demos: Because Tavus CVI lets your AI SDR see the user’s screen, respond to objections, and navigate your product in a natural, face-to-face conversation—24/7, in 30+ languages.
  • Best for guided onboarding and support: Because the AI Human can watch users click through your app, notice confusion in tone or body language, and adapt its coaching in real time instead of sending static help articles.

Limitations & Considerations

  • Network Quality Matters: Tavus is engineered for sub-second latency, but you still need reasonably stable bandwidth and low jitter. For production, implement network checks and fallbacks (e.g., audio-only mode) for poor connections.
  • Client Integration Required: Tavus hides the hard parts of multimodal AI, but you still have to integrate media capture, permissions, and UI. Plan for a bit more client work than a pure text chatbot, especially around WebRTC and permissions UX.

Pricing & Plans

Tavus pricing depends on interaction volume, real-time requirements, and enterprise guarantees, but the entry point is always the same: a Developer Account you can start with today.

Typical structure:

  • Developer / Builder Plan: Best for developers and early-stage teams needing to prototype and ship real-time AI Humans quickly. Ideal if you’re experimenting with Tavus CVI, building a POC, or integrating into a single product surface.
  • Enterprise Plan: Best for larger teams needing scale, uptime commitments, and security/compliance support—think deploying AI Humans across sales, support, and success with enterprise uptime guarantees and performance SLAs.

To see current pricing and get aligned on usage, start with a Developer Account, then talk to Tavus if you’re planning high-volume or multi-team deployment.


Frequently Asked Questions

How do I start my first real-time Tavus CVI conversation in development?

Short Answer: Create a Developer Account, configure an AI Human, request a real-time session from your backend, and connect to that session from your frontend using Tavus’s client SDK or WebRTC integration.

Details: Once your Developer Account is set up, you’ll define your agent (persona, voice, tools) in the Tavus console. Your server then calls the Tavus API to create a real-time session for that agent, receiving a client-safe token and signaling info. Your web or mobile client uses that token to join the session, streaming mic/camera (and optionally screenshare) and rendering the remote AI Human stream in a video element. From there, you handle conversation events (transcripts, agent speaking, actions) to drive your product logic.

Do I need deep WebRTC expertise to use Tavus CVI?

Short Answer: Not necessarily. Tavus abstracts most of the real-time signaling, but you still need to handle basic media capture and UI.

Details: If you’ve built a simple video call UI before, you’ll feel at home. Tavus provides SDKs or example code that handle signaling, ICE, and track orchestration. You’re responsible for getting user permissions for mic/cam, managing device selection, attaching incoming media streams to your UI, and managing reconnection logic for unstable networks. For more advanced setups (multi-participant sessions, custom SFU routing), Tavus can still fit into your existing real-time stack, but that’s an incremental step after you’ve validated your initial CVI integration.


Summary

Starting a real-time conversation with Tavus CVI as a developer isn’t about wiring up low-level pipelines; it’s about plugging your app into a real-time AI Human that already knows how to see, hear, and respond like a person. Your first steps are simple: create a Developer Account, define your AI Human, request a real-time session from your backend, and connect to it from your client. Tavus takes care of the perception → speech → reasoning → rendering loop so you can focus on the experience you’re building—demos that sell themselves, onboarding that feels like a coach, and support that looks you in the eye.


Next Step

Get Started