Gladia data retention and opt-out: how do I ensure our audio isn’t used for training and is deleted after processing?
Speech-to-Text APIs

Gladia data retention and opt-out: how do I ensure our audio isn’t used for training and is deleted after processing?

7 min read

Most teams I talk to want one thing above all from an STT provider: strong accuracy without losing control of their data. If you’re integrating Gladia into a voice product, you need to know exactly how data retention works, when audio may be used for training, and how to opt out so your streams are deleted after processing.

Quick Answer: Gladia only uses Free Plan users’ audio for model training; paid plans are not used for training by default. Data is retained only as long as needed for processing and operational needs, then removed; you can also request removal of usage data at any time via support@gladia.io and configure stricter policies in your implementation and contract.

Frequently Asked Questions

Who can have their audio used for Gladia model training?

Short Answer: Only users on the Free Plan can have their audio used for model training; paid plans are excluded from training by default.

Expanded Explanation:
Gladia’s training policy is deliberately narrow. Audio from Free Plan users may be used to test and improve the underlying ASR and audio intelligence models. This is how Gladia iterates on quality while keeping clear boundaries: training use is limited to one plan type, with a defined purpose (benchmarking and model improvement), not broad “data monetization.”

If you’re on a paid plan (startup, growth, enterprise, or a custom contract), your audio is not used for model training. This separation is designed for teams handling sensitive conversations—contact centers, healthcare providers, financial services—where compliance and trust require strict constraints on how data is processed.

Key Takeaways:

  • Free Plan audio may be used to train and improve Gladia models.
  • Paid plan audio is not used for model training by default.

How do I make sure our audio is deleted after transcription or streaming?

Short Answer: By default, Gladia only retains data for as long as needed for processing and service operation, then removes it; you can additionally request removal of usage data via support and align stricter retention terms in your agreement.

Expanded Explanation:
Gladia’s data retention is scoped to what’s operationally necessary to deliver and maintain the service—processing your requests, ensuring stability, and improving the source code of the AI service with curated datasets. Those internal datasets are time-limited, used strictly for testing and improving the models, and do not contain information that directly identifies individuals.

After the defined processing and retention period, data is completely removed from Gladia’s servers and dashboards. If you need stronger assurances—for example, because you’re operating under strict internal policies or regulatory frameworks—you can (1) formalize a short retention window in your contract, and/or (2) trigger removal of usage data by contacting support.

Steps:

  1. Confirm your plan: Ensure you’re on a paid plan if you don’t want your audio used for training.
  2. Align retention in your contract: During onboarding or procurement, set explicit retention windows (e.g., “delete after processing” or X days).
  3. Request data removal when needed: Email support@gladia.io to request removal of usage data for specific projects, accounts, or time ranges, referencing your organization and environment.

What’s the difference between “datasets for evaluation” and our production audio?

Short Answer: Evaluation datasets are curated, de-identified corpora used to test and improve Gladia’s source code; your production audio is your operational data, processed to deliver transcription and related features.

Expanded Explanation:
Gladia maintains internal “Datasets” containing several thousand hours of recordings across 100+ languages. These datasets are used solely to test and improve the AI service’s source code—things like WER on noisy telephony, DER on multi-speaker calls, and stability across languages. They do not contain information that directly identifies individuals, and they’re retained only as long as necessary for testing.

By contrast, your production audio flows through Gladia’s API (REST or WebSocket) to power your workflows: real-time transcripts, timestamps, diarization, NER, sentiment, and summaries. That data is handled under your plan’s privacy posture and retention terms. If you’re on a paid plan, your production traffic is not automatically added into those training/evaluation datasets.

Comparison Snapshot:

  • Evaluation Datasets: Curated, de-identified, multi-language corpora; used only to test and improve source code; retained for a limited period.
  • Your Production Audio: Your operational data sent via API; retained only as needed for service delivery and operations under your plan and contract.
  • Best for: Using Gladia as an STT backbone where you get benchmark-driven improvements without shipping your sensitive production conversations into broad training pools.

How do I implement Gladia so that we’re fully opted out of training and tightly controlled on retention?

Short Answer: Use a paid plan, codify retention and training exclusions in your commercial terms, and configure your integration and internal policies to enforce “delete after processing” or short-lived storage.

Expanded Explanation:
From an implementation standpoint, you control most of the risk surface with three layers: plan selection, legal/commercial terms, and integration design. Pick a paid plan to avoid training use, lock in your retention/treatment requirements in your DPA and MSA, then architect your integration so you don’t persist more than you need on either side.

For high-sensitivity environments, you can go further: dedicated or regional hosting, and even on‑prem or air‑gapped deployments if your security team requires full environmental control. Gladia is built to support organizations with strict data boundaries—GDPR, HIPAA, SOC 2, ISO 27001—without treating that as an add‑on.

What You Need:

  • A paid or enterprise plan with defined terms: Confirm in writing that your audio is excluded from model training, and set clear retention windows.
  • Integration + ops controls: Architect your stack so that:
    • You only send the minimum necessary audio.
    • You don’t rely on Gladia as a long-term store of transcripts—pull what you need, then let retention policies delete the rest.
    • You have an internal process to trigger additional removal via support@gladia.io when required (e.g., data subject requests or incident response).

How does Gladia’s data retention and opt‑out posture support compliance (GDPR, HIPAA, SOC 2, ISO 27001)?

Short Answer: Tight retention, limited training scope, and auditable deletion controls make it easier to satisfy GDPR, HIPAA, SOC 2, and ISO 27001 requirements when Gladia is your speech backbone.

Expanded Explanation:
Compliance work is about being able to prove what happens to data: purpose, scope, retention, and deletion. Gladia’s approach—clear segmentation of Free Plan training use, purpose-limited evaluation datasets, time‑bound retention, and explicit deletion paths—maps cleanly to GDPR principles (data minimization, purpose limitation, storage limitation) and HIPAA/SOC/ISO controls around PHI handling and lifecycle management.

Because Gladia never positions security as an upsell, you don’t end up in a situation where “real” privacy is locked behind a custom tier. Instead, you use contracts and configuration to dial in the retention and opt‑out posture your regulator or internal policy needs, and you can document it with concrete language: who can be used for training (only Free Plan), what the purpose is (testing and improving source code), and how long data is kept (minimum necessary, then deletion).

Why It Matters:

  • Reduced audit risk: You can point auditors to explicit policies: Free vs paid training use, retention periods, deletion processes, and compliance certifications (GDPR, HIPAA, SOC 2, ISO 27001).
  • Stable infrastructure posture: You’re building on an audio backbone that treats privacy as non‑negotiable, aligning your STT layer with the same rigor you apply to your core application data stores.

Quick Recap

To ensure your audio isn’t used for training and is deleted after processing, you should use a paid Gladia plan, codify your retention and training exclusions in your contract, and design your integration around short-lived processing. Only Free Plan audio may be used for model training, internal evaluation datasets are de‑identified and time‑bound, and all users can request removal of usage data via support@gladia.io. Combined with Gladia’s compliance posture (GDPR, HIPAA, SOC 2, ISO 27001), this gives you a speech‑to‑text backbone that respects strict data boundaries while still improving through benchmark‑driven evaluation.

Next Step

Get Started