How do I enable speaker diarization and word-level timestamps in Gladia’s async transcription API?
Speech-to-Text APIs

How do I enable speaker diarization and word-level timestamps in Gladia’s async transcription API?

6 min read

Bad STT doesn’t just mangle words—it breaks your entire workflow. When your transcript can’t reliably say who spoke when, or where a key sentence sits in the audio, your notes, summaries, and CRM syncs all drift out of sync with reality. That’s exactly what speaker diarization and word-level timestamps are designed to prevent in Gladia’s async transcription API.

Quick Answer: Enable speaker diarization and word-level timestamps in Gladia’s async transcription API by passing the relevant options in your transcription request payload. You control both features at request time, so you can turn them on per-job without changing your integration.

Frequently Asked Questions

How do I turn on speaker diarization in Gladia’s async transcription API?

Short Answer: You enable speaker diarization by including the diarization option in your async transcription request. Once enabled, Gladia will return segments grouped by speaker so you can see “who said what” across the file.

Expanded Explanation:
Speaker diarization is what makes a transcript usable in real conversations—meetings, sales calls, support queues—where multiple voices overlap. In Gladia’s async transcription API, diarization is a first-class add-on: when you activate it, the response is structured into speaker-specific segments instead of a flat wall of text.

Gladia’s diarization works across mono, stereo, and multi‑channel audio, which matters if you’re pulling from telephony (8 kHz SIP), recorded meetings, or mixed media. You get cleanly separated blocks with consistent speaker labels, so downstream logic (summaries, QA checks, CRM notes) can safely attribute each statement to the right person.

Key Takeaways:

  • Enable diarization via a specific flag/parameter on your async transcription request.
  • The API response comes back with speaker-labelled segments, ready to drive “who said what” views and diarized summaries.

How do I enable word-level timestamps in my async transcription requests?

Short Answer: Turn on word-level timestamps by setting the appropriate timestamp option in your async transcription request. This instructs Gladia to return start/end times for each word in the transcript.

Expanded Explanation:
Word-level timestamps are essential when you need your text to align precisely with audio or video. In Gladia’s API, you request them on demand; when enabled, every token in the transcript is annotated with its position in the audio. That lets you build scrubbers, subtitles, or fine-grained search (“jump to where the customer mentions the competitor”) directly on top of the raw output.

Because timestamps are generated at the word level, you’re not limited to coarse segment start/end times. You can cut, re‑align, and replay at high resolution—critical if you’re building compliance review tools, training libraries, or multilingual subtitling pipelines.

Steps:

  1. Add the timestamp-related option to your async transcription request payload (e.g., in the JSON body alongside language and add-ons).
  2. Submit your audio file or URL to the async endpoint as usual.
  3. Parse the returned transcript object and use the per‑word timestamps to build subtitles, seek bars, or in‑context playback.

What’s the difference between segment timestamps and word-level timestamps?

Short Answer: Segment timestamps mark when a speaker segment starts and ends; word-level timestamps mark when each individual word starts and ends inside those segments.

Expanded Explanation:
Both timestamp types tell you “when” in the audio something happened, but at different resolutions. Segment timestamps align to larger blocks: a diarized utterance from Speaker 1, then another block from Speaker 2, and so on. They’re ideal for high-level navigation—“skip to the next speaker,” “jump to Q&A.”

Word-level timestamps go much deeper. They annotate every word, so you can pinpoint specific phrases, numbers, or named entities and jump to that exact moment in the audio. This is especially useful when your downstream system relies on precision: syncing captions frame-by-frame, verifying how a legal disclaimer was read, or training an agent assist model that needs tight alignment between acoustic events and text.

Comparison Snapshot:

  • Segment timestamps: Start/end times for each diarized or logical block of speech; good for chapter-like navigation and speaker views.
  • Word-level timestamps: Start/end times for every word; required for fine-grained search, subtitles, and high-precision QA.
  • Best for: Using both together when you want “who said what when” at both macro (segments) and micro (words) levels in a single transcript.

How do I implement diarization + word-level timestamps together in production?

Short Answer: Request both diarization and word-level timestamps in the same async job, then structure your pipeline around the combined response: parse speakers at the segment level, and drive UX/search from the word-level timing inside those segments.

Expanded Explanation:
In practice, you rarely want to choose between diarization and word-level timestamps; most production systems need both. With Gladia’s async transcription API, you can enable them simultaneously via request parameters. The API will return diarized segments that each contain word-level timing information, so you don’t have to stitch separate outputs together.

Implementation effort largely boils down to response parsing and storage design. You’ll want a schema that preserves speaker labels, segment boundaries, and per‑word timings so multiple workflows can reuse the same transcript: analytics, CRM enrichment, subtitles, coaching, and QA. Because Gladia exposes everything behind a single API, you don’t need separate services or pipelines to add diarization later.

What You Need:

  • An async transcription request that includes:
    • Diarization enabled.
    • Word-level timestamps enabled (timestamp mode/flag).
  • A response parser and data model that:
    • Stores segments keyed by speaker.
    • Preserves per‑word timestamps inside each segment for reuse across search, playback, and subtitle generation.

Why should I care about diarization and word-level timestamps for async transcription?

Short Answer: Because without diarization and word-level timestamps, your transcripts can’t reliably power automation: you’ll misattribute speakers, miss key entities, and be unable to align text with audio for notes, summaries, and CRM syncs.

Expanded Explanation:
Async transcription is the backbone of most post-call workflows—summaries, QA scoring, CRM updates, coaching. If the transcript isn’t speaker-aware, you can’t differentiate between agent and customer, so your analytics and coaching signals get muddled. If you can’t anchor text to the audio at word-level granularity, you can’t build reliable subtitles, in‑context review, or defensible compliance audits.

Gladia’s stack is designed specifically to close these gaps in real-world audio conditions: noisy meetings, accented speakers, cross-talk, telephony at 8 kHz. You get diarization across mono/stereo/multi-channel inputs and word-level timestamps for every transcript, across 100+ languages, through a single async API. That’s what makes the output safe to use as infrastructure for your product, not just as a rough reference.

Why It Matters:

  • Downstream reliability: Accurate “who said what when” is what keeps summaries, CRM sync, NER, and sentiment pipelines from drifting or misfiring.
  • Production-grade UX and compliance: Word-level timestamps plus diarization unlock trustable subtitles, agent/customer separation for QA, and verifiable playback for legal or regulatory review.

Quick Recap

To enable speaker diarization and word-level timestamps in Gladia’s async transcription API, you pass the relevant options directly in your transcription request. Diarization structures your transcript by speaker across mono, stereo, and multi-channel audio, while word-level timestamps annotate each word with start/end times. Using both together gives you precise, speaker-aware alignment between text and audio—exactly what you need to keep notes, summaries, analytics, and CRM syncs stable even on noisy real-world calls and meetings.

Next Step

Get Started