How do teams use AI during on-call to triage incidents faster without making risky changes?
AI Coding Agent Platforms

How do teams use AI during on-call to triage incidents faster without making risky changes?

12 min read

On-call always collapses two competing pressures into the same minute: “fix this now” and “don’t break anything else.” AI can help you triage incidents much faster, but only if you design the system to keep humans in control of production changes, not bypass them.

This is where the distinction between “AI autocomplete” and “AI Droids in the war room” matters. The former suggests snippets in your IDE. The latter takes delegated tasks across Slack/Teams, terminals, and repos, then returns artifacts you can review: hypotheses, diffs, runbooks, and PRs.

Below, I’ll walk through how teams actually use AI during on-call to reduce MTTR without creating new risks, and how patterns from Factory’s Droids map into a safe operating model.


Quick Answer: How teams safely use AI on-call

Most high-velocity teams that use AI during incidents follow a simple pattern:

  1. Use AI to see more, faster
    Let Droids collect context across logs, metrics, traces, tickets, and code so engineers don’t spend the first 20 minutes just catching up.

  2. Keep humans in the change loop
    AI proposes hypotheses, diagnostics, and patches. Humans decide what to run, what to roll out, and when to revert.

  3. Enforce permissions and auditability
    The AI only sees what the engineer already has access to, and every query, suggestion, and patch is logged for post-incident review.

In practice, that looks like:

  • Droids in Slack/Teams during an incident thread.
  • Droids in the terminal/IDE for deeper diagnostics.
  • Droids in project trackers to link incident → fix → PR.

The AI accelerates triage and response while your existing controls (RBAC, code review, CI, change management) keep risk bounded.


At-a-glance: Where AI fits in the on-call lifecycle

You can think of AI participation during on-call in four phases:

PhaseAI RoleOutcomeRisk Guardrail
1. Detection & TriageSummarize alerts, correlate signals, surface likely blast radiusFaster understanding of “how bad is it?”Read-only access to observability + strict permissions
2. DiagnosisHypothesis generation, log/trace/code analysis, config diffingShorter time to root cause candidatesNo direct infra writes; suggestions only
3. MitigationPropose runbook steps, safe config toggles, and patch diffsFaster mitigation with less guessworkHuman approves commands/PRs; CI/CD still gates deploys
4. Post-incidentTimeline reconstruction, impact analysis, documentationBetter learning with less toilFull audit trail; AI operates on incident artifacts

Done right, the AI is the fastest reader in the room, not an autonomous production operator.


How Droids show up during incidents

Factory’s core design assumption is simple: on-call engineers don’t have time to teach tools what’s going on. So Droids are built to drop into the existing war room and terminals instantly, not ask for a fresh setup.

Droids in the war room (Slack/Teams)

When an incident kicks off:

  1. Triage from the incident channel

    • An engineer posts the initial alert payload or links the OpsGenie/PagerDuty incident.
    • A Droid is mentioned (@droid triage this) in Slack/Teams.
    • The Droid pulls context from the alert, relevant runbooks, recent deployments, and linked services.

    It responds with a compact triage summary:

    • What changed recently (code, config, infra) in affected services.
    • Top candidate failure modes based on logs/metrics.
    • Suggested severity, potential blast radius, and immediate “stop-the-bleeding” actions (rate limiting, feature flag toggles, traffic shifts).
  2. Guided investigation, not blind commands
    Droids don’t ssh into prod on their own. Instead they:

    • Suggest focused log queries in your logging stack.
    • Generate kubectl or CLI commands you can paste into a secure terminal.
    • Highlight suspicious traces, slow queries, or recent schema changes.

    Engineers keep hands on the keyboard. Droids reduce the search space.

  3. Cross-team translation
    During incidents product and support often pile into the same channel. Droids help:

    • Generate non-technical summaries (“which customers are impacted, and how?”).
    • Translate technical state into service/status page-ready language.
    • Reduce the back-and-forth Q&A that adds 24-hour delays across time zones.

Outcome: On-call can move from chaos to a shared plan in minutes, with Droids doing the clerical work: context gathering, summarizing, correlating.

Droids where you code (IDE / terminal)

Once you’ve narrowed down the likely component:

  1. Jump from incident to code

    • From Slack/Teams, a Droid can link directly to the suspected repo and file.
    • In VS Code, JetBrains, or Vim, the same Droid already knows the incident context.

    It can:

    • Explain how the failing service works, using your code and documentation.
    • Identify risky recent commits touching the relevant path.
    • Highlight potential N+1 patterns, unbounded retries, or error handling gaps.
  2. Generate and evaluate hypotheses
    The on-call engineer can ask:

    • “Why would this endpoint start timing out under increased load?”
    • “Show code paths where this error message can be emitted.”
    • “Compare current config for service X between staging and prod.”

    The Droid returns:

    • Specific call stacks and code references.
    • Config diffs and probable misconfigurations.
    • Links to prior incidents or tickets with similar failure signatures.
  3. Draft safe patches and tests
    Instead of hand-writing a fix at 3am:

    • The Droid proposes a patch: guardrails, timeouts, fallbacks, or feature flag integration.
    • It generates tests reproducing the failure mode where feasible.
    • It produces a candidate PR description referencing the incident.

    Crucially, it stops at the PR boundary. Humans still run tests locally or via CI and decide whether to merge.

Droids in your backlog

For larger fallout—like a deep refactor or structural fix—AI still plays a role after the fire is out:

  • When an incident ticket is created or updated, Droids can:
    • Generate follow-up tasks (e.g., “add SLOs for endpoint X,” “improve retry policy for dependency Y”).
    • Link code-level artifacts (PRs, runbooks, diagrams) to the original incident.
    • Prepare a skeleton postmortem from the incident channel logs and deployment timeline.

Teams don’t forget what they learned under pressure, because the system codifies it.


How AI makes triage faster without increasing risk

Speed without more risk depends on architecture and controls, not just model quality. These are the patterns I see working in production.

1. Strict permissions and environment isolation

On-call engineers need AI that respects existing access controls:

  • Strict permissions enforcement
    Droids only see what the caller can see in source systems—repos, ticketing, logs. No elevated access, no cross-tenant data bleed.

  • Single-tenant, sandboxed environments
    Each organization runs in its own isolated environment (dedicated VPC) with encrypted data flows (TLS 1.2+/AES-256). That matters when your incident involves customer data or regulated workloads.

  • No surprise training on your incidents
    Teams don’t want their worst outage becoming someone else’s training data. Factory’s stance: customer code and incident data are not used as training data without prior written consent.

These controls let you safely pipe real incident data through Droids without creating a compliance or IP headache.

2. Read-heavy, write-light design during incidents

The fastest way to create new outages is to let tools write to prod directly. A safer pattern:

  • AI reads from observability, tickets, code, and config.
    It builds a high-fidelity picture of the system state.

  • AI proposes artifacts; humans execute.

    • Shell commands to run in your secure terminal.
    • Rollback steps using your deployment tooling.
    • Config toggles using your existing feature flag system.
    • Code patches in PRs.

You can even formalize this: no AI-owned credentials for production changes; only humans hold deploy and infra-write permissions.

3. Explicit planning and tool usage

During an incident, “agent magic” isn’t helpful. Deterministic planning is.

Factory’s Droids follow explicit plans:

  1. Discover environment and context: what service, what alerts, what recent changes.
  2. Choose minimal tools: logs, traces, git history, config snapshots.
  3. Generate a short, inspectable plan: “First, confirm error rate spike; then check recent deploy; then diff config; then propose mitigations.”
  4. Execute step by step, surfacing intermediate findings.

Because the plan is visible and compact, engineers can interrupt or re-route it when new information shows up. That keeps human operators in charge while still gaining speed.

4. Auditability and post-incident learning

If an AI contributed to resolving an incident, it should be visible in the postmortem.

  • Full audit logs
    Every significant interaction—queries, code suggestions, triage summaries, and generated commands—can be logged and exported to your SIEM. You can answer: “What did the Droid suggest, and what did we run?”

  • Traceability from incident to PR
    Droids can tag PRs and code edits with incident references. Factory Analytics then lets teams see:

    • How many files were touched for a fix.
    • How many PRs the incident spawned.
    • Time from incident creation to first remedial PR.

This turns AI from a black box into a traceable participant, which is essential if you later need to explain to leadership or auditors how decisions were made.


Concrete examples of safe AI usage during on-call

Here are a few patterns I’ve seen work repeatedly.

Pattern 1: Rapid RCA candidate generation

Scenario: Latency spikes in a payments API after a routine deployment.

Without AI:

  • On-call spends 15–30 minutes parsing Grafana, Kibana, and recent PRs.
  • Knowledge of obscure code paths lives in one senior engineer’s head.

With Droids:

  • In the incident Slack channel, a Droid:
    • Correlates latency increase with a specific deployment.
    • Flags that the deployment added a synchronous call to a slower dependency.
    • Shows the code diff and the specific function responsible.

Guardrails:

  • Droid suggests a rollback and a toggled fallback path.
  • Engineer runs kubectl rollout undo or triggers the rollback in the existing deployment system.
  • Patch to reintroduce async behavior is pushed as a PR and reviewed.

Result: Faster root cause identification, no unsupervised production change.

Pattern 2: “I’ve never seen this service before” on-call handoff

Scenario: A new engineer is on call for a legacy service with sparse documentation.

With Droids in the war room and IDE:

  • The engineer asks in Slack: “Explain how service X handles authentication and where it calls service Y.”
  • The Droid responds with:
    • A high-level flow description.
    • Code references to key auth middleware and downstream calls.
    • A generated diagram of the request path.

If an incident triggers, that same context is reused in the terminal and IDE. The engineer doesn’t need to re-ask or re-explain.

Guardrails remain the same: the Droid never bypasses git review or deployment gates; it just compresses the time it takes the engineer to become competent.

Pattern 3: Structured postmortems, less toil

Scenario: After a major outage, the team needs a post-incident review document by morning.

With Droids:

  • The Droid ingests:
    • The incident Slack/Teams thread.
    • Alert timelines and deployment events.
    • Linked PRs and runbooks.

It outputs a draft postmortem with:

  • Chronological timeline of events and decisions.
  • Impact summary and key metrics.
  • Root cause hypothesis and confirmed contributing factors.
  • Action items with suggested owners and links back to tickets.

Humans refine, correct, and approve, instead of rebuilding the timeline from scratch. Risk doesn’t increase; institutional learning does.


How this changes MTTR in practice

When you instrument this properly (via Factory Analytics or your own OTEL setup), you can track:

  • Time from first alert → first Droid triage response.
    Typically seconds, not minutes.

  • Time from incident start → first plausible root cause candidate.
    Often 30–60% shorter because the AI parallelizes log/code/config analysis.

  • Time from PR creation → approval for incident-related fixes.
    Customers using Review Droids for automated code review have seen up to ~50% reduction here, since reviewers get structured feedback and tests alongside the PR.

  • Reduction in cross-timezone Q&A lag.
    Automated, AI-generated flowcharts and impact summaries can cut product/dev back-and-forth by ~50%, as seen in practice.

Taken together, these shifts lower MTTR not by bypassing safeguards, but by compressing everything around them: context gathering, analysis, and coordination.


Implementation checklist: Using AI during on-call without risky changes

If you’re designing or evaluating an AI system for on-call, use this as a quick checklist:

  1. Surfaces

    • Available in Slack/Teams for war room triage.
    • Available in IDE/terminal for code-level diagnosis.
    • Integrated with your incident/ticket system for traceability.
  2. Access & security

    • Strict permissions: AI only sees what the on-call can see.
    • Single-tenant or strong tenant isolation (own VPC).
    • TLS 1.2+ in flight and AES-256 at rest.
    • Clear policy: incident data not used as training data without consent.
  3. Risk controls

    • No direct production writes by the AI (deploys, DB changes) without explicit human-triggered pipelines.
    • All code changes proposed as PRs, not force-pushed.
    • All AI-suggested commands visible to humans before execution.
  4. Observability

    • Audit logs exportable to your SIEM.
    • Correlation IDs linking incidents ↔ AI actions ↔ PRs.
    • MTTR and “autonomy ratio” (how much of the workflow Droids handle) measured via analytics.
  5. Agent design

    • Explicit, inspectable plans for incident workflows.
    • Minimal, robust tool schemas for logs, code, config.
    • Error handling and timeouts that degrade gracefully (e.g., partial triage instead of failing the whole task).

If these boxes are checked, you can safely let AI sit in the war room and at your terminal while keeping humans in charge of risk.


Final verdict

Teams use AI during on-call to triage incidents faster by turning Droids into the connective tissue between alerts, code, and collaboration—not into unsupervised operators. The winning pattern is:

  • AI as the fastest analyst in the room: reading logs, code, and tickets across tools.
  • Humans as final operators: approving commands, PRs, and deploys.
  • Enterprise controls as the guardrails: strict permissions, audit logs, and isolated environments.

This is how you lower MTTR, reduce on-call anxiety, and keep your change management and compliance posture intact.


Next Step

Get Started