How do I build an AI agent workflow that can retry tool calls and keep progress even if a step crashes?
Durable Workflow Orchestration

How do I build an AI agent workflow that can retry tool calls and keep progress even if a step crashes?

12 min read

Most teams building AI agents run into the same nightmare: a single slow or flaky tool call crashes the run, you lose partial progress, and you’re left piecing together what happened from logs and ad‑hoc trace IDs. That’s fine for a weekend project—not for a production agent coordinating multiple tools and model calls.

In this guide, I’ll walk through how to build an AI agent workflow that:

  • Retries tool calls automatically on failure
  • Checkpoints progress at each step
  • Resumes from the last successful step instead of starting over
  • Survives crashes, timeouts, and deploys without losing context

All using Inngest’s event-driven, durable execution model: inngest.createFunction() + step.run().


Quick Answer:
Use Inngest to model your agent as a durable workflow where each tool call is a step.run() with automatic retries and checkpointing. When a step fails, Inngest retries it with backoff; once a step succeeds, its result is persisted so your agent can resume from that step even if the process crashes or is redeployed.


At-a-Glance Comparison

There are three common ways teams try to get reliability for AI agents:

RankOptionBest ForPrimary StrengthWatch Out For
1Inngest Durable Agent WorkflowProduction multi-step AI agentsCode-level durability with step.run() and TracesRequires adopting Inngest’s model (functions + steps)
2Custom Queue + WorkersTeams with heavy infra investmentFull control over queues/workersYou own retries, idempotency, DLQs, and tooling
3Single Lambda / Serverless FunctionSimple, short-lived agentsEasy to prototypeTimeouts, no checkpointing, hard-to-debug failures

Comparison Criteria

We’ll evaluate each approach using three reliability criteria that matter for AI agent workflows:

  • Durability & Checkpointing:
    Can the workflow survive crashes, deploys, and timeouts without losing progress? Can it resume from the last successful tool call?

  • Retries & Idempotency:
    Are retries automatic and safe? Can you avoid duplicate side effects (e.g., sending the same email twice, double-writing to a CRM)?

  • Observability & Recovery:
    When something goes wrong, can you see each tool call’s inputs/outputs, understand what failed, and replay safely—without custom admin tooling?


1. Inngest Durable Agent Workflow (Best overall for production AI agents)

Inngest is the top choice when you need your AI agent to be reliable by default: each step.run() is a durable, retriable unit of work with automatic checkpointing and built-in observability via Traces.

Core Idea: Model each tool call as a Step

Instead of one big “agent” function that calls tools inline, you wrap each call in a Step:

import { inngest } from "./client";

export const agentWorkflow = inngest.createFunction(
  { id: "agent-workflow" },
  { event: "agent/run" },
  async ({ event, step }) => {
    // 1. Plan
    const plan = await step.run("generate-plan", async () => {
      // Call your LLM to generate a plan/tool sequence
      return await llm.generatePlan(event.data.query);
    });

    // 2. Execute tools in sequence
    const toolResults = [];
    for (const toolCall of plan.toolCalls) {
      const result = await step.run(
        `tool-${toolCall.name}`,
        async () => await runTool(toolCall)
      );
      toolResults.push(result);
    }

    // 3. Summarize
    const summary = await step.run("summarize", async () => {
      return await llm.summarize({ query: event.data.query, toolResults });
    });

    return { summary, toolResults };
  }
);

Mechanics → Outcome:

  • Each step.run() is a code-level transaction
  • On failure, Inngest retries the step with backoff
  • On success, Inngest checkpoints the result
  • If your process crashes or you redeploy, the workflow resumes from the last successful step, not from the beginning

What it does well

1. Durability & Checkpointing

Every Step is durable. In practice:

  • If tool-analytics times out on call 7 of 10, only that step is retried
  • Steps 1–6 are preserved; you don’t re-hit those tools or re-run the LLM plan generation
  • If your runtime (edge, serverless, or traditional) restarts, the next run picks up from the failed step

You get this just by using step.run()—no custom state store, no homegrown “resume token” logic.

2. Automatic Retries with Backoff

Retries are built in at the step level. You can customize behavior per tool:

const result = await step.run(
  "tool-search",
  {
    retry: {
      attempts: 5,
      maxDuration: "10m",
      factor: 2,             // exponential backoff
      minDelay: "2s"
    }
  },
  async () => {
    return await searchTool({ query: event.data.query });
  }
);

Mechanics:

  • Transient failures (timeouts, 500s, network flakiness) are retried
  • Each attempt is tracked in Traces, so you see every retry, not just the final failure
  • Once a step finally succeeds, Inngest ensures it’s run exactly once for the workflow

3. Idempotency Without Reinventing It

Because Inngest checkpointing ensures a successful step won’t re-run, a lot of idempotency pain vanishes:

  • If you send an email or write to a CRM in a Step, it won’t run again after success—even if the worker crashes later
  • If you need extra safety, you can still add idempotency keys at the tool layer, but you’re not relying on them for basic resilience

4. Observability & Recovery with Traces and Replay

Inngest Cloud gives you instant Traces for each agent run:

  • Step-by-step view of the workflow
  • Step inputs/outputs (including prompts and LLM responses for AI workflows)
  • Retry attempts and error messages
  • Duration per step

When something goes wrong, you can:

  • Query: filter runs by tenant, event data, error message, or step
  • Cancel: stop noisy or stuck runs in bulk
  • Replay: rerun failed workflows from the last successful checkpoint—no bespoke admin panel

For AI agents, this is especially valuable: you can inspect every prompt / response pair and see which tool call caused the issue.

5. Flow Control for Multi-Tenant Agents

If your agent runs per tenant or per user, multi-tenant flow control matters:

export const agentWorkflow = inngest.createFunction(
  {
    id: "agent-workflow",
    concurrency: {
      key: "event.data.accountId", // per-tenant
      limit: 3                     // max 3 runs per tenant
    }
  },
  { event: "agent/run" },
  async ({ event, step }) => { /* ... */ }
);

Mechanics → Outcome:

  • Concurrency keys prevent noisy neighbors: one tenant’s heavy load won’t starve others
  • Throttling and rate limits let you respect external API quotas without rewriting the agent logic
  • Prioritization lets you make some runs (e.g., user-facing) preempt background ones

Tradeoffs & Limitations

  • You adopt the Inngest model: functions + events + steps. That’s a shift from “just drop everything in one Lambda.”
  • You still design your agent architecture (planning, tool selection, memory). Inngest doesn’t decide how your agent reasons—it ensures those decisions execute reliably.
  • For extremely simple, single-call agents, this can feel like “more structure than you need” (though you’ll usually grow into needing it).

Decision Trigger

Choose the Inngest durable agent workflow if:

  • Your agent makes multiple tool calls or LLM calls per request
  • You care about not losing partial progress when something crashes
  • You want Retries, Checkpointing, and Traces out of the box and don’t want to rebuild workers, queues, and recovery tooling

2. Custom Queue + Workers (Best for teams that want full infra control)

The classic approach: you wire up a queue (SQS, RabbitMQ, Kafka) plus workers (Lambda, containers, K8s) and encode your agent workflow in jobs and messages.

What it does well

  • Full control over how jobs are scheduled, retried, and routed
  • Can be tuned precisely to your infra (custom backoff, DLQ routing, prefetch, etc.)
  • Works well if you already have a mature queue/worker stack and an SRE team maintaining it

Where it hurts AI agent workflows

1. You’re rebuilding Steps and Checkpointing

To get behavior comparable to step.run():

  • You model each tool call as its own job
  • You maintain state between jobs (DynamoDB/Postgres/etc.)
  • You create a “resume token” so you can restart from the last successful job

You’re essentially building your own durable workflow engine. The work is:

  • Defining job schemas and transitions
  • Handling partial failures and backoffs
  • Coordinating replays and idempotency across multiple services

2. Manual Retries and DLQs

Queues give you basic retries, but:

  • You configure retry counts on the queue and/or worker
  • Failures go to a dead-letter queue (DLQ)
  • Recovery becomes “manually inspect DLQs and replay messages”

For multi-step AI agents, this is brittle:

  • One stuck tool call sends messages to a DLQ that doesn’t know the workflow context
  • Replaying from DLQ re-runs steps that may have already succeeded unless you layer idempotency everywhere

3. Observability is a Build Project

To get a trace of “Agent run 123 called Tools A, B, C with these prompts and got these outputs,” you usually need to:

  • Add trace IDs to messages
  • Ship logs to a central store
  • Build a UI or query pattern to reconstruct runs
  • Integrate Prometheus or Datadog manually

That’s all infrastructure tax, not agent logic.

Decision Trigger

Choose custom queue + workers if:

  • You’re heavily invested in an existing queue/worker stack and have infra capacity to extend it
  • You need very custom behavior that can’t be modeled in a function + steps pattern
  • You’re comfortable owning DLQs, idempotency, and observability for the long term

3. Single Serverless Function (Best for simple, short-lived agents)

A lot of teams start here: a single Lambda/edge/serverless function that:

  • Receives an HTTP request or webhook
  • Calls the LLM and tools inline
  • Returns the final answer

What it does well

  • Easy to prototype
  • Minimal moving parts; deploy once and you’re live
  • Good for simple “one-call” agents (e.g., “call this API and summarize the result”)

Where it breaks for reliable AI agents

1. No Checkpointing

If your function:

  • Times out (long LLM call, slow API)
  • Crashes (runtime error, OOM)
  • Is killed during deploy

…you lose everything. There’s no “resume from step 3 of 7”—you just re-run the whole function, including:

  • Regenerating the plan
  • Re-calling successful tools
  • Risking duplicate side effects (double emails, double writes)

2. Ad-hoc Retries

You can add manual try/catch + setTimeout or re-invoke the function, but:

  • You’re implementing your own backoff logic
  • You still don’t have durability between retries (if the function restarts, your in-memory state is gone)
  • It’s easy to accidentally retry the entire agent instead of a single failed call

3. Limited Observability

You mostly have:

  • Logs (scattered across function invocations)
  • Maybe a tracing integration if you wire it in yourself

You don’t get a first-class view of:

  • Each step’s inputs/outputs
  • Which tool call failed
  • How many retries happened and why

Decision Trigger

Use a single function if:

  • Your agent is essentially a thin wrapper around one tool or one LLM call
  • You’re in early prototype mode and can accept recomputation and occasional partial failures
  • You don’t yet need multi-step, multi-tenant durability

You’ll likely outgrow this as soon as you have more than 2–3 tool calls or multi-tenant usage.


How to Implement a Durable AI Agent Workflow with Inngest

Let’s put it all together in a concrete pattern you can adapt.

1. Define the “run agent” event

Your agent starts from an event—API call, webhook, or schedule:

// Somewhere in your API route:
import { inngest } from "./client";

await inngest.send({
  name: "agent/run",
  data: {
    accountId: "acct_123",
    userId: "user_456",
    query: "Analyze traffic and suggest SEO improvements",
  },
});

Triggers can be:

  • API calls (from your backend or edge functions)
  • Webhooks (e.g., CRM events, GitHub, Stripe)
  • Schedules (e.g., nightly account review agents)

2. Create the durable agent function

import { inngest } from "./client";
import { llm, runTool } from "./agent";

export const agentWorkflow = inngest.createFunction(
  {
    id: "agent-workflow",
    concurrency: {
      key: "event.data.accountId",
      limit: 2,
    },
  },
  { event: "agent/run" },
  async ({ event, step }) => {
    const { query, accountId } = event.data;

    // Step 1: Plan
    const plan = await step.run("plan", async () => {
      return await llm.generatePlan({ query, accountId });
    });

    // Step 2: Execute tools with per-tool retries
    const toolResults = [];
    for (const toolCall of plan.toolCalls) {
      const result = await step.run(
        `tool-${toolCall.name}`,
        {
          retry: {
            attempts: 4,
            maxDuration: "15m",
            factor: 2,
            minDelay: "1s",
          },
        },
        async () => {
          return await runTool(toolCall, { accountId });
        }
      );

      toolResults.push({ name: toolCall.name, result });
    }

    // Step 3: Summarize
    const summary = await step.run("summarize", async () => {
      return await llm.summarize({ query, toolResults });
    });

    return { summary, toolResults };
  }
);

Key behaviors:

  • Each tool call is its own Step → retriable, checkpointed
  • If any tool crashes, only that Step is retried
  • If the deploy happens mid-run, Inngest resumes from the last successful Step
  • In Traces, you see: plantool-footool-barsummarize with all inputs/outputs

3. Run locally with one command

Local dev doesn’t require extra infra:

npx --ignore-scripts=false inngest-cli dev

This gives you:

  • Local Inngest Dev Server
  • A UI to inspect runs and steps
  • Automatic reload when you edit functions

Your code remains “just business logic” with step.run() sprinkled in where durability matters.

4. Operate in production

With Inngest Cloud:

  • Deploy your functions from the environments you already use (edge, serverless, traditional)
  • Use Traces to debug and understand agent behavior in real time
  • Use Replay and Bulk Cancellation to recover from incidents (e.g., misconfigured tool, bad model version) without writing custom admin UIs
  • Export metrics to Prometheus/Datadog if you need them, and rely on built-in alerting/retention

All with production guardrails like SOC 2 Type II, SSO/SAML, optional HIPAA BAA, and support for 100K+ executions per second.


Final Verdict

If you want an AI agent that can retry tool calls safely and keep progress even when steps crash, you need durability expressed in code, not scattered across queues, DLQs, and log dashboards.

  • A single serverless function is fast to start but brittle for multi-step agents.
  • A custom queue + worker stack can be made reliable, but you’ll rebuild years of reliability work: retries, checkpointing, observability, replay.
  • Inngest gives you code-level durability via step.run(), instant Traces for every prompt and tool call, and first-class recovery tools so you can query, cancel, or replay without re-implementing infrastructure.

Model your agent as an Inngest Function, wrap each tool call in a Step, and let the platform handle retries and checkpointing so you can focus on the agent’s behavior—not on keeping it alive.


Next Step

Get Started