Temporal Cloud SLA: what’s included in 99.9% vs 99.99% and how do we request HA?
Durable Workflow Orchestration

Temporal Cloud SLA: what’s included in 99.9% vs 99.99% and how do we request HA?

7 min read

Downtime is expensive. When you move critical workflows like payments, order fulfillment, or AI pipelines to Temporal Cloud, you want to know exactly what 99.9% vs 99.99% availability means—and what it takes to get High Availability (HA) for your namespaces.

Quick Answer: Temporal Cloud offers different SLAs (commonly 99.9% and 99.99%) that define how much downtime per month is allowed for the Temporal Service. Higher SLAs typically include HA configuration (multi-AZ, replication, disaster recovery) and are enabled through your Cloud plan and support agreement—HA is not something you “toggle” in the UI; it’s provisioned and confirmed with the Temporal team.


Frequently Asked Questions

What’s the practical difference between 99.9% and 99.99% availability in Temporal Cloud?

Short Answer: 99.9% availability allows about 43.8 minutes of unplanned downtime per month; 99.99% reduces that to about 4.4 minutes. The higher SLA tier is designed for truly critical workflows where even short outages are unacceptable.

Expanded Explanation:
Availability is just math: 99.9% uptime means your Temporal Cloud cluster can be unavailable for roughly 8.76 hours per year; 99.99% cuts that to ~52.6 minutes per year. For non-critical workloads—internal tools, non-blocking batch jobs—99.9% may be acceptable. For money movement, telecom provisioning, or customer-facing flows that gate revenue, those extra “nines” are the difference between a minor blip and a serious incident.

Temporal Cloud is built for demanding workloads with high availability in mind: built-in replication, automatic failover, and disaster recovery are part of the architecture. But your SLA formalizes the minimum uptime guarantee and the commercial remedies if we ever fall short.

Key Takeaways:

  • 99.9% ≈ 43.8 minutes of potential downtime per month; 99.99% ≈ 4.4 minutes.
  • Use 99.99% for business- or safety-critical workflows where outages directly impact revenue, SLAs, or compliance.

How do we actually request or enable High Availability in Temporal Cloud?

Short Answer: You request HA through Temporal Cloud sales/support—usually as part of your initial contract or a plan upgrade. The Temporal team then provisions your namespaces on a highly available, replicated deployment that matches the SLA and region strategy you choose.

Expanded Explanation:
High Availability in Temporal Cloud isn’t a checkbox in the dashboard. It’s a deployment characteristic: multi-AZ architecture, built‑in replication, and disaster recovery tuned to your workload and SLA. When you engage with Temporal (via the “Get Started”/“Sign up for Cloud” flow or directly with sales), you specify your reliability requirements, regions, and approximate scale. The team uses that to design and provision an HA configuration that backs your SLA.

If you’re already a Temporal Cloud customer on a 99.9% SLA and you want to move to a higher-availability posture, you work with your account team. They’ll assess your current namespaces, traffic patterns, and criticality, then plan the migration or upgrade with you—including any maintenance windows and testing you might want.

Steps:

  1. Engage Sales/Support: Use the Temporal Cloud “Get Started” path or contact your account rep to describe your HA and SLA targets.
  2. Define Requirements: Align on regions, expected scale (Actions/second), workloads that require stronger guarantees, and compliance needs.
  3. Provision & Confirm: Temporal provisions or upgrades your Cloud deployment to the appropriate HA/SLA tier and confirms the configuration, regions, and support level with you.

Besides the SLA number, what’s really different between a “standard” Cloud setup and a higher-HA one?

Short Answer: The higher-HA setup typically adds stronger replication, multi‑AZ/multi‑zone resilience, and more stringent operational practices tied to your SLA, while your application code and Worker model remain the same.

Expanded Explanation:
From a developer’s perspective, Temporal Cloud looks the same: you write Workflows and Activities, run Workers in your environment, and connect to the Temporal Service over secure, unidirectional connections. Either way, we never see your code.

The difference is in how the Temporal Service is run for you. Higher-HA deployments emphasize additional redundancy, tuned failover, and stricter operational SLOs, backed by 24/7 support options. The goal is simple: minimize the chance that a regional issue, zone failure, or internal incident interrupts your ability to start, schedule, or advance Workflows.

Comparison Snapshot:

  • Standard SLA (e.g., 99.9%):
    Designed for most production workloads; strong reliability with built-in durability, but with more tolerated downtime in the rare case of a service disruption.
  • Higher SLA (e.g., 99.99% with HA):
    Architected for critical applications with tighter replication, failover assumptions, and stronger operational commitments around uptime and recovery.
  • Best for:
    Use the higher-HA tier for payment flows, telecom provisioning, identity/auth, durable ledgers, or AI/agent pipelines where missed steps or delays directly impact customers or revenue.

What changes for my team when we move to a higher SLA/HA configuration?

Short Answer: Your application code doesn’t change—Workflows, Activities, and Workers behave the same—but your risk profile improves: fewer disruptions, stronger uptime guarantees, and clearer escalation paths with 24/7 support options.

Expanded Explanation:
Temporal’s core promise—durable execution that survives crashes, timeouts, and outages—does not depend on the SLA tier. Even with a lower SLA, Workflows automatically capture state at every step, so they resume from the last recorded event once the Service is back. The HA/SLA tier affects how likely you are to hit an outage at all, and how tightly that uptime is contractually guaranteed.

Operationally, you’ll typically pair higher HA with stronger support: 24/7 coverage, defined response times, and expert services like design reviews and performance tuning. That gives both engineering and SRE teams a clearer runbook: when something is on the edge of the SLA, you have a direct line to the people who built the platform.

What You Need:

  • Clear Criticality Map: A list of which namespaces/workflows are “must stay up” versus “can tolerate occasional downtime.”
  • Aligned Support Plan: A support tier that matches your SLA expectations (e.g., 24/7 support and defined response SLAs for incident handling).

How should we decide whether to choose 99.9% or 99.99% for our Temporal Cloud workloads?

Short Answer: Use 99.99% for revenue-critical, user-facing, or compliance-sensitive workflows; 99.9% can be sufficient for internal tooling, best-effort batch, or workloads where brief outages don’t materially harm the business.

Expanded Explanation:
In distributed systems, failures are inevitable—APIs fail, networks flake, and services crash. Temporal makes those failures irrelevant to workflow completion, but your SLA choice determines how often the Temporal Service itself is allowed to be down and for how long.

Think in terms of business blast radius:

  • If Temporal Cloud is down for 30 minutes, what happens?
  • Do payments stop? Are users unable to onboard? Are regulatory time windows missed?
  • Or do you simply delay some nightly batch processing?

For high-blast-radius flows—money movement, provisioning network services, updating critical customer records, orchestrating AI workloads on expensive GPUs—favor 99.99% and HA. For lower-blast-radius flows—report generation, non-critical data pipelines—the 99.9% tier may be the more efficient choice.

Why It Matters:

  • Risk vs Cost Alignment: You pay for stronger guarantees where they actually protect revenue, SLAs, and brand, not everywhere by default.
  • Architectural Simplicity: Temporal’s durability already removes most of the retry/state-machine complexity; pairing the right SLA tier with that model keeps both your code and your operations simpler.

Quick Recap

Temporal Cloud gives you durable execution as a service: Workflows capture state at every step and can survive crashes, network issues, and service restarts without losing progress. Your SLA choice—99.9% vs 99.99%—controls how much downtime the Temporal Service itself can have and what level of HA and operational commitment backs your workloads. Higher-HA tiers use built-in replication and disaster recovery features to minimize disruptions for critical applications, while your code and Worker model stay the same. To request HA, you work with Temporal’s sales and support team to provision or upgrade your Cloud deployment to the appropriate SLA and region strategy.

Next Step

Get Started