
Gladia data retention and opt-out: how do I ensure our audio isn’t used for training and is deleted after processing?
Most teams worry about two things when they wire STT into production: “Is our audio being used to train someone else’s model?” and “How long does anything stay on their servers?” If you’re evaluating Gladia, those are exactly the right questions to ask up front.
Quick Answer: On Gladia’s paid plans, your audio is not used for model training, and processing data is retained only as long as needed for the service and then deleted. You can also explicitly request deletion of usage data at any time via support and configure stricter retention in line with your security posture.
Frequently Asked Questions
Who can have their audio used for model training at Gladia?
Short Answer: Only users on the Free Plan may have their audio used to train Gladia’s models; paid customers’ audio is not used for training.
Expanded Explanation:
Gladia separates usage by plan when it comes to training data. Free-tier usage can be sampled to improve the Solaria model line, evaluation harnesses, and related tooling. This is how we iterate quickly without pushing those costs or tradeoffs onto production customers.
On paid plans, audio is processed strictly to deliver the service you requested (transcription, diarization, translation, etc.). It is not repurposed to build or fine‑tune models. Internally, training datasets are curated and anonymized, and they do not contain information that can directly identify individuals. No commercial exploitation or individual profiling is performed on these datasets.
Key Takeaways:
- Only Free Plan traffic is eligible for use in model training.
- Paid plans are kept separate from training workflows and are used solely to run the API features you call.
How do I ensure our audio is deleted after processing?
Short Answer: By default, Gladia retains data only for the period necessary to provide and monitor the service, after which it is removed; you can also explicitly request removal of usage data by contacting support.
Expanded Explanation:
Gladia’s retention policy is built around minimization: data is only kept for as long as it is operationally needed to run the service, handle billing, debug issues, and improve reliability. After this period, it is removed from both the dashboard and the underlying servers.
If your internal policies require stricter deletion guarantees, you can submit a deletion request for your usage data at any time by emailing support@gladia.io. For enterprise deployments (including on‑prem or air‑gapped), retention can be tailored to your governance rules so that transcripts and audio never outlive the workflows they support.
Steps:
- Choose an appropriate plan: Use a paid plan if you want full separation from any training pipelines by default.
- Align on retention internally: Define how long you actually need transcripts/logs for debugging, QA, or compliance.
- Request deletion when needed: Email support@gladia.io with your workspace/account details and the scope of data you want removed; Gladia will process the deletion and confirm once completed.
What’s the difference between “usage data” and “training datasets”?
Short Answer: Usage data is your operational API traffic; training datasets are separate, curated corpora used to test and improve Gladia’s models under strict rules.
Expanded Explanation:
Usage data includes the audio, transcripts, metadata, and logs generated when your systems call Gladia’s API. This is tied to your account, used to deliver features and billing, and is subject to retention limits and deletion on request.
Training datasets are a distinct construct: several thousand hours of recordings, across 100+ languages, specifically collected and processed for evaluating and improving the AI service’s source code. They are governed by explicit constraints—no commercial exploitation, no individual profiling, and no direct identifiers. Datasets are retained only for the time strictly necessary to test and improve the models, then deleted.
Comparison Snapshot:
- Option A: Usage data: Your live audio and transcripts created through the API; used for service delivery, support, and operations.
- Option B: Training datasets: Pre‑curated audio corpora used only to test and improve model performance, with no direct identifiers.
- Best for: Treat usage data as your operational footprint; treat training datasets as Gladia’s internal R&D corpus with tightly scoped use and retention.
How do I implement stricter data controls (opt‑out, deletion, hosting)?
Short Answer: Use a paid plan, align with Gladia on retention and hosting (cloud, region, or on‑prem), and document your opt‑out and deletion requirements with the support and account teams.
Expanded Explanation:
Most teams building telephony, note‑taking, or voice‑agent products already have data governance policies in place. Gladia is designed to slot into those constraints rather than fight them. By default, the service is delivered in a secure cloud environment that can be tailored to your geographic footprint; for stricter environments, Gladia can support on‑premises and even air‑gapped hosting.
Operationally, you’ll want a written understanding of: (1) whether your plan is subject to training (free vs paid), (2) your desired retention period for logs and artifacts, and (3) the process for auditable deletion. Once set, your teams can safely build workflows—CRM sync, analytics, QA—on top of Gladia’s transcripts without worrying that audio is being repurposed beyond your agreed scope.
What You Need:
- A paid or enterprise plan: To ensure your audio is not used for model training and to unlock custom retention/hosting options.
- Documented data policy with Gladia: Cover retention windows, deletion SLAs, and (if needed) region‑specific or on‑prem deployment.
How does this align with compliance (GDPR, HIPAA, SOC 2, ISO 27001)?
Short Answer: Gladia’s data retention and opt‑out controls are designed to support GDPR, HIPAA, SOC 2, and ISO 27001 compliance, with strict limits on how datasets are used and how long they’re kept.
Expanded Explanation:
Under GDPR and similar regimes, you need clarity on purpose limitation, data minimization, retention, and data‑subject rights. Gladia’s privacy stance is built around those principles:
- Datasets are used only to test and improve the AI service’s source code.
- No commercial exploitation or individual profiling is performed.
- Datasets are deleted once their testing/improvement purpose is fulfilled.
- You can exercise your rights (access, deletion) by contacting support, even though Gladia’s datasets do not contain information that directly identifies individuals.
On top of that, Gladia operates under enterprise‑grade controls—GDPR compliant, HIPAA compliant, AICPA SOC Type 2, ISO 27001 compliant—and offers deployment options (including air‑gapped) for organizations with extremely tight security requirements.
Why It Matters:
- Regulatory fit: You can use Gladia as the STT backbone for regulated workloads (healthcare, finance, public sector) while staying aligned with your own compliance framework.
- Risk containment: Clear separation of training vs usage data, strict dataset purpose, and limited retention reduce the chance of data being used in ways your governance model doesn’t allow.
Quick Recap
To keep your audio from being used for training and ensure it’s deleted after processing, run Gladia on a paid plan, define your retention expectations, and document deletion procedures with the team. Only Free Plan traffic may feed model improvement; training datasets are separate, anonymized corpora that exist solely to test and improve the AI service’s source code, with no commercial exploitation or individual profiling. Combined with GDPR/HIPAA/SOC 2/ISO 27001 alignment and flexible hosting options, this gives you a clear, auditable path to using Gladia as your speech‑to‑text backbone without losing control over your data.