How do I set up SLOs and error budget burn alerts in Dynatrace for our critical services?
Application Observability

How do I set up SLOs and error budget burn alerts in Dynatrace for our critical services?

10 min read

For most teams, the hardest part of reliability engineering isn’t agreeing on targets—it’s wiring SLOs and error budgets into day‑to‑day operations so that burn alerts trigger the right response at the right time. Dynatrace makes this practical by tying SLOs directly to real user experience, service health, and topology, then alerting on error‑budget burn using causation‑based AI instead of static thresholds.

Below is a step‑by‑step guide to set up SLOs and error budget burn alerts in Dynatrace for your critical services, and turn them into actionable signals that can drive automation, tickets, and governance.


1. Decide what “critical” means and which SLOs you need

Before you click anything in Dynatrace, clarify:

  • Which services are truly critical?
    Typically: customer‑facing APIs, core microservices, payment or checkout flows, authentication, and key internal platforms (e.g., CI/CD, identity).

  • What type of SLO are you defining?

    • Availability / uptime SLO – “Service responds successfully X% of the time”
    • Latency / performance SLO – “p95 response time below Y ms”
    • User experience SLO – “X% of user sessions are ‘satisfied’”
    • Business transaction SLO – “Y% of checkouts complete successfully”
  • How will you measure it?
    With Dynatrace you can base SLOs on:

    • Service metrics (error rate, response time, throughput)
    • Real User Monitoring (RUM) or synthetic metrics
    • Business process or custom metrics (e.g., Business Flow or custom DQL‑backed metrics)

Deciding this upfront ensures the SLO you configure maps to real system behavior and can be enforced with error budget alerts.


2. Ensure full‑stack visibility for your critical services

Error budget burn alerts are only as good as the data behind them. For core services, make sure you have:

  • OneAgent deployed end‑to‑end

    • On all relevant application servers, services, and process groups
    • On Kubernetes/OpenShift clusters that host your microservices
    • On supporting infrastructure (VMs, hosts, cloud services where applicable)
  • Real user and synthetic monitoring configured where needed

    • RUM enabled for key web or mobile applications that depend on the service
    • Synthetic monitors for outside‑in checks of key endpoints and user flows
  • Clear service boundaries and names
    Dynatrace’s automatic discovery and real‑time topology mapping will identify services and dependencies. Give critical services meaningful names and tags (e.g., critical:true, team:checkout, service_tier:tier1) to make them easy SLO targets.

This topology‑first approach ensures your SLOs are defined on entities Dynatrace understands in context, from user impact down to underlying dependencies.


3. Create an SLO in Dynatrace for a critical service

In the Dynatrace UI:

  1. Navigate to SLOs
    • Go to Observe and exploreService level objectives (wording may vary slightly by version).
  2. Add a new SLO
    • Click Create SLO (or Add SLO).
  3. Choose your SLO type
    • Availability‑based SLO
      • Example: “API availability SLO for /checkout”
      • Base it on successful request rate or error rate for the service.
    • Latency‑based SLO
      • Example: “p95 latency SLO for /auth”
      • Base it on response time metric at p90/p95/p99.
    • Custom / metric‑based SLO
      • Use custom metrics or DQL‑driven metrics (e.g., business transaction success ratio).
  4. Select the metric and filter to the critical service
    • Pick your key metric (e.g., Failure rate, Response time, Availability).
    • Filter by:
      • Service name or process group
      • Tags (e.g., service:checkout-api, critical:true)
      • Endpoint or request attributes if you only want specific operations
  5. Define your target and time window
    • Example targets:
      • 99.9% availability for tier‑1 APIs over 30 days
      • 99% “satisfied sessions” for your primary web app over 7 days
    • Choose a rolling time window that aligns with your operational cadence (commonly 7 or 30 days).
  6. Let Dynatrace calculate the error budget
    • Once you set the SLO target and time window, Dynatrace computes the error budget automatically:
      • Error budget = (100% – SLO target) * total opportunities
      • Example: 99.9% SLO over 30 days ⇒ 0.1% error budget

Save the SLO and verify that:

  • The SLO status is calculated correctly.
  • The error budget remaining and burn over time visuals look reasonable.
  • It is scoped only to the intended critical service and traffic.

4. Understand error budget burn behavior

In Dynatrace, your SLO view shows:

  • Current SLO status – whether you are meeting your target.
  • Error budget remaining / consumed – how much of the allowed failure you’ve used.
  • Burn rate over time – how fast the budget is being consumed.

To operationalize this:

  • Define what constitutes “too fast” burn
    For example:

    • “If we burn 20% of monthly budget in 2 hours, that’s a page.”
    • “If we cross 50% of monthly budget in the first week, we slow down releases.”
  • Decide on response levels
    Map burn situations to actions:

    • Low burn: continue as usual, monitor.
    • Medium burn: raise visibility to the team, investigate proactively.
    • High burn: page on‑call, initiate incident response, pause risky changes.

You’ll use these definitions to configure alert conditions and automation.


5. Set up alerts on SLO and error budget burn

Dynatrace lets you set alerts “with one‑click in context” directly from the SLO. The goal is to alert on burn behavior, not just raw metric spikes.

5.1 Configure SLO‑based alerts

  1. Open the SLO you created

    • From the SLO overview, click into your critical service SLO.
  2. Create an alert from the SLO

    • Look for Alert settings, Create alert, or similar option on the SLO details page.
  3. Choose your trigger type Typical patterns:

    • SLO status threshold
      • Trigger when SLO status falls below a certain percentage (e.g., < 99.5% on a 99.9% target).
    • Error budget remaining threshold
      • Trigger when error budget remaining drops below a defined value (e.g., < 50% remaining).
    • Error‑budget burn rate (where available via metrics/DQL)
      • Trigger when the burn rate exceeds a threshold (e.g., burning 5x faster than expected).
  4. Set thresholds aligned to your burn strategy Example multi‑tier thresholds for a 30‑day, 99.9% SLO:

    • Warning (email/Teams/Slack)
      • Error budget remaining < 70%
      • Purpose: early awareness; investigate latent issues.
    • Severe (page on‑call)
      • Error budget remaining < 40% OR
        SLO status < 99.8% within the first week of the period.
      • Purpose: active incident response; protect remainder of budget.
    • Critical (change gate / auto‑action)
      • Error budget remaining < 20%
      • Purpose: enact “release freeze” or stricter controls via automation.
  5. Attach contextual information

    • Ensure the alert includes:
      • SLO name and target
      • Current error budget remaining
      • Recent burn trend
      • Link back to the SLO and affected service pages in Dynatrace

Dynatrace Intelligence backs these alerts with causation‑based analysis, so when they fire, they link directly to the underlying problem—avoiding generic “something is wrong” notifications and focusing attention on root causes.


6. Avoid alert fatigue with root‑cause‑driven notifications

Traditional SLO monitoring often generates alert storms when downstream systems fail. Dynatrace avoids this by using topology + causation‑based AI:

  • Topology awareness
    Real‑time service maps let Dynatrace understand all dependencies for your critical services—databases, message queues, APIs, Kubernetes components, and infrastructure.

  • Problem correlation and root cause
    Davis® AI analyzes metrics, logs, traces, and code‑level information to detect problems and determine their actual root cause, not just correlated symptoms.

  • Actionable alerts only
    Instead of alerting on every metric anomaly, Dynatrace:

    • Opens a Problem with a precise root‑cause statement
    • Connects it to impacted SLOs and error‑budget burn
    • Notifies you on the root‑cause problem, not every affected metric

You can further tune this via alerting profiles and maintenance windows so your SLO‑driven alerts match team expectations and business hours without losing critical coverage.


7. Integrate SLO and burn alerts into your workflows

Once you trust your SLOs and error budget alerts, you can shift from observation to preventive and autonomous operation.

7.1 Route alerts into your existing workflows

From Dynatrace’s alerting configuration:

  • Connect to incident‑management tools

    • Create or update incidents in systems like ServiceNow, Jira, Opsgenie, PagerDuty.
    • Include SLO and error budget context in the ticket payload.
  • Send notifications to collaboration channels

    • Push SLO violations or high burn events to Teams, Slack, or email lists.
    • Use different channels for warning vs. critical burn levels.

7.2 Trigger automated remediation and governance with Workflows

Use Dynatrace Workflows to respond automatically to error budget burn events:

  • Automated remediation

    • Scale out a service in Kubernetes when burn accelerates.
    • Trigger a canary rollback or redirect traffic away from a problematic region.
    • Execute custom code (e.g., Lambda/Function) for specialized fixes.
  • Release and change gating

    • Implement quality gates in CI/CD using Dynatrace SLOs:
      • Block releases when the error budget for a critical service is below a threshold.
      • Require manual approval when SLO status is fragile (e.g., just above target).
    • Connect burn alerts to change‑management workflows to slow down or pause high‑risk changes when reliability is under pressure.
  • Forecasting and preventive actions

    • Use Dynatrace’s forecasting capabilities to alert on future SLO breaches:
      • If current trends indicate error budget exhaustion within a short time, trigger an early warning.
      • Start proactive investigations or capacity changes before users feel impact.

This is where SLOs move from reporting to active control: you’re not just measuring reliability, you’re governing it in real time.


8. Scale SLOs across many critical services

In larger environments, you’ll often manage dozens or hundreds of SLOs. To keep this manageable:

  • Standardize SLO templates

    • Define common patterns (e.g., “99.9% availability for tier‑1 APIs”, “99% satisfied sessions for core apps”).
    • Apply them across services using tags like service_tier:tier1, critical:true.
  • Use tagging and management zones

    • Group SLOs by:
      • Team or domain (e.g., team:payments)
      • Environment (prod vs non‑prod)
      • Business domain (e.g., domain:checkout, domain:identity)
    • Use management zones to expose only relevant SLOs and alerts to each team.
  • Central SRE governance with local ownership

    • SRE platform teams define global SLO patterns, alerting policies, and error budget policies.
    • Individual service teams own their SLOs and implement local automation and remediation using Workflows.
  • Use the SLO overview and dashboards for executive visibility

    • Provide leaders with a consolidated view of critical SLO health and error budget consumption.
    • Focus on trends and burn patterns, not just static pass/fail status.

Even here, the goal is not more dashboards—it’s to route answers about SLO risk into the workflows and automation that matter most.


9. Governance, trust, and human oversight

When you connect SLOs and error budgets to automated actions, governance matters:

  • Trusted data and AI

    • Dynatrace’s Trust Center outlines how data protection, privacy, and Trusted AI principles underpin the platform.
    • Deterministic, causation‑based AI ensures that when a workflow triggers, it’s responding to explainable root cause, not opaque correlation.
  • Human‑in‑the‑loop for high‑impact actions

    • Keep humans in control of major decisions:
      • Require human acknowledgment before production rollbacks or large‑scale traffic shifts.
      • Use automation for safe, reversible actions; escalate to experts for structural changes.
  • Continuous review

    • Regularly review:
      • SLO definitions and targets (do they still reflect business expectations?).
      • Error budget policies (are you too lax or too strict?).
      • Automation behavior (did workflows take the right actions?).

This continuous governance loop transforms SLOs from static SLAs into living contracts between teams, with Dynatrace as the execution engine.


Final decision framework

To set up SLOs and error budget burn alerts in Dynatrace for your critical services and make them operational:

  1. Define what “critical” means and choose SLO metrics that map directly to user and business outcomes.
  2. Ensure full‑stack coverage with OneAgent, RUM, and synthetics so SLOs reflect real behavior.
  3. Create SLOs directly in Dynatrace, let the platform compute error budgets, and monitor burn.
  4. Configure SLO and error budget alerts in context, using thresholds that reflect your risk tolerance.
  5. Rely on Dynatrace Intelligence to cut alert noise and focus on root‑cause‑driven notifications.
  6. Integrate alerts into incident workflows and use Workflows to automate remediation, quality gates, and preventive actions.
  7. Scale and govern SLOs across teams with standardized policies and ongoing human oversight.

Once these pieces are in place, SLOs and error budget burn alerts stop being abstract SRE theory and become concrete levers you can use to protect reliability, control change, and safely advance toward preventive and autonomous operations.

Get Started