
Temporal Cloud SLA: what’s included in 99.9% vs 99.99% and how do we request HA?
Most teams don’t read SLAs until something breaks. With Temporal Cloud, the point of the SLA isn’t just a number—it’s a guarantee that your Workflows keep making progress even when networks flake, zones go down, or infrastructure fails under load.
Quick Answer: Temporal Cloud offers a standard 99.9% availability SLA and a higher 99.99% availability SLA for workloads that require stricter uptime guarantees. High availability (HA) is achieved through built-in replication and disaster recovery, and you can request higher SLAs and HA configurations through Temporal sales/support as part of your Temporal Cloud plan and region setup.
Frequently Asked Questions
What’s the difference between 99.9% and 99.99% SLA in Temporal Cloud?
Short Answer: 99.9% SLA allows for more downtime per month than 99.99%, but both options are backed by Temporal Cloud’s built-in replication, disaster recovery, and scaling. The 99.99% tier is designed for the most critical Temporal namespaces and regions where you cannot afford even short service disruptions.
Expanded Explanation:
Temporal Cloud is built to keep your Workflows running through failures. The SLA just formalizes how much unplanned control-plane unavailability you’re contractually allowed to see over time. At 99.9%, you’re already getting a highly available, replicated service with automatic failover and durable state. At 99.99%, you’re asking us to operate your cluster to an even tighter uptime envelope—better protections, faster recovery, more stringent operational SLOs behind the scenes.
Remember: an outage in the Temporal Service doesn’t mean your code is lost. Your Workflow histories are persisted durably. When the Service is back, Workflows resume from the last recorded event with no manual recovery. The higher SLA simply reduces how often you’ll hit that “service is unavailable” window in the first place.
Key Takeaways:
- 99.9% SLA covers most production workloads that need strong reliability and durability.
- 99.99% SLA is for the highest-criticality workloads where even short unavailability windows are unacceptable.
How do I request high availability (HA) for Temporal Cloud?
Short Answer: You request HA by working with Temporal sales/support to choose the appropriate SLA tier, region(s), and namespace configuration; Temporal Cloud then provisions your account on our replicated, disaster-ready architecture.
Expanded Explanation:
HA in Temporal Cloud is not something you bolt on yourself with scripts. It’s built into the service: replication, failover, and disaster recovery are handled at the platform layer so your team doesn’t maintain clustering logic or database replication. When you onboard to Temporal Cloud or adjust your plan, you specify your availability needs (99.9% vs 99.99%), preferred regions, and workload characteristics. The Temporal team configures your namespaces on the appropriate HA setup and documents the SLA in your agreement.
If you’re migrating from self-hosted, you can also coordinate a cutover plan so you don’t lose in-flight Workflows. The Cloud team will help you size, plan, and validate the HA configuration that matches your risk tolerance and growth.
Steps:
- Contact Temporal (sales or support) and state you want HA with a specific SLA target (99.9% or 99.99%).
- Select region(s) and discuss workload characteristics (traffic profile, actions/sec, critical paths).
- Finalize the SLA and HA configuration in your Temporal Cloud contract and onboarding, then migrate or launch your namespaces on that setup.
Is 99.9% SLA enough, or should I choose 99.99%?
Short Answer: 99.9% is enough for many production systems; 99.99% is for your most critical Workflows where even brief Temporal control-plane unavailability would materially impact your business.
Expanded Explanation:
Think about impact, not just the number. At 99.9%, you’re allowed more downtime per month than at 99.99%, but in both cases your Workflow state is durable and will resume when the service recovers. The question is: what happens to your users and your business if Temporal Cloud is briefly unavailable?
For example, a long-running AI pipeline that can tolerate a short pause might be fine on 99.9%. A payment processing or order fulfillment Workflow that must be continuously available during business hours may justify 99.99%. You can mix: keep most namespaces at 99.9% and reserve 99.99% for the few that move money, ship goods, or gate revenue.
Comparison Snapshot:
- 99.9% SLA: Strong availability and durability for most production workloads; small risk of short service windows each month.
- 99.99% SLA: Tighter uptime target, more appropriate for business-critical paths where every minute of unavailability hurts.
- Best for:
- 99.9%: Standard microservices backends, AI agents, internal tools.
- 99.99%: Payments, ledgers, fulfillment, core customer-facing flows.
How does Temporal Cloud actually deliver HA behind these SLAs?
Short Answer: Temporal Cloud achieves HA using replication, disaster recovery, automatic scaling, and a hardened control plane, so your Workflow histories remain durable and your Workers resume from the last recorded event after failures.
Expanded Explanation:
Failures are inevitable: zones fail, databases lose nodes, and networks partition. Temporal Cloud’s job is to make those failures irrelevant to your Workflow outcomes. The Service persists every Workflow state transition as an event history. If the Service crashes or a region has issues, that history is still safe. Once the control plane is healthy again, your Workers reconnect, pull tasks from task queues, and your code continues from exactly where it left off—no custom replay logic, no manual intervention.
The platform also scales automatically to hundreds of thousands of Actions per second, so you aren’t fighting resource bottlenecks as you grow. Replication and disaster recovery are handled as part of the managed service, not another system your team maintains. You keep Workers in your own environment, and the Temporal Service coordinates execution from the cloud—either way, we never see your code.
What You Need:
- A Temporal Cloud account with the appropriate SLA tier and regions configured.
- Workers deployed in your environment that can reconnect and resume processing when the Service recovers.
How do SLAs and HA translate into business outcomes for my team?
Short Answer: Higher SLAs and built-in HA reduce firefighting, eliminate manual recovery, and give you predictable reliability for critical workflows, which directly protects revenue and engineering velocity.
Expanded Explanation:
Without Temporal, teams handle failures by writing brittle state machines, scattering retry logic across services, and keeping manual runbooks for when things go wrong. When a database or region goes down mid-process, you’re left guessing what completed, what didn’t, and what needs to be replayed. That guesswork burns time and introduces double-charges, duplicate shipments, or inconsistent customer state.
With Temporal Cloud and a clear SLA, you get a durable execution backbone: your long-running Workflows don’t lose progress, you see exactly where each Workflow is in the Web UI, and you can inspect, replay, or rewind as needed. The SLA and HA posture mean you can plan around well-understood uptime guarantees, not arbitrary surprise outages. The result: fewer incidents, faster incident resolution when they do happen, and more confidence to encode business-critical flows—payments, provisioning, AI pipelines—directly as Temporal Workflows.
Why It Matters:
- Reduced operational risk: no orphaned processes, no manual recovery, predictable uptime targets.
- Higher delivery velocity: developers write business logic as code instead of reinventing reliability infrastructure.
Quick Recap
Temporal Cloud’s SLA options (99.9% and 99.99%) sit on top of a Durable Execution platform that already treats failures as routine and recoverable. The 99.9% SLA is sufficient for many production workloads, while 99.99% targets the most critical flows where even short unavailability windows are unacceptable. High availability comes from built-in replication, disaster recovery, and automatic scaling—not ad-hoc scripts you maintain—while your Workflow state is always durably persisted and resumable.