How do I build an AI agent workflow that can retry tool calls and keep progress even if a step crashes?

Most teams building AI agents run into the same nightmare: a single slow or flaky tool call crashes the run, you lose partial progress, and you’re left piecing together what happened from logs and ad‑hoc trace IDs. That’s fine for a weekend project—not for a production agent coordinating multiple tools and model calls.

In this guide, I’ll walk through how to build an AI agent workflow that:

Retries tool calls automatically on failure
Checkpoints progress at each step
Resumes from the last successful step instead of starting over
Survives crashes, timeouts, and deploys without losing context

All using Inngest’s event-driven, durable execution model: inngest.createFunction() + step.run().

Quick Answer:
Use Inngest to model your agent as a durable workflow where each tool call is a step.run() with automatic retries and checkpointing. When a step fails, Inngest retries it with backoff; once a step succeeds, its result is persisted so your agent can resume from that step even if the process crashes or is redeployed.

At-a-Glance Comparison

There are three common ways teams try to get reliability for AI agents:

Rank	Option	Best For	Primary Strength	Watch Out For
1	Inngest Durable Agent Workflow	Production multi-step AI agents	Code-level durability with `step.run()` and Traces	Requires adopting Inngest’s model (functions + steps)
2	Custom Queue + Workers	Teams with heavy infra investment	Full control over queues/workers	You own retries, idempotency, DLQs, and tooling
3	Single Lambda / Serverless Function	Simple, short-lived agents	Easy to prototype	Timeouts, no checkpointing, hard-to-debug failures

Comparison Criteria

We’ll evaluate each approach using three reliability criteria that matter for AI agent workflows:

Durability & Checkpointing:
Can the workflow survive crashes, deploys, and timeouts without losing progress? Can it resume from the last successful tool call?
Retries & Idempotency:
Are retries automatic and safe? Can you avoid duplicate side effects (e.g., sending the same email twice, double-writing to a CRM)?
Observability & Recovery:
When something goes wrong, can you see each tool call’s inputs/outputs, understand what failed, and replay safely—without custom admin tooling?

1. Inngest Durable Agent Workflow (Best overall for production AI agents)

Inngest is the top choice when you need your AI agent to be reliable by default: each step.run() is a durable, retriable unit of work with automatic checkpointing and built-in observability via Traces.

Core Idea: Model each tool call as a Step

Instead of one big “agent” function that calls tools inline, you wrap each call in a Step:

import { inngest } from "./client";

export const agentWorkflow = inngest.createFunction(
  { id: "agent-workflow" },
  { event: "agent/run" },
  async ({ event, step }) => {
    // 1. Plan
    const plan = await step.run("generate-plan", async () => {
      // Call your LLM to generate a plan/tool sequence
      return await llm.generatePlan(event.data.query);
    });

    // 2. Execute tools in sequence
    const toolResults = [];
    for (const toolCall of plan.toolCalls) {
      const result = await step.run(
        `tool-${toolCall.name}`,
        async () => await runTool(toolCall)
      );
      toolResults.push(result);
    }

    // 3. Summarize
    const summary = await step.run("summarize", async () => {
      return await llm.summarize({ query: event.data.query, toolResults });
    });

    return { summary, toolResults };
  }
);

Mechanics → Outcome:

Each step.run() is a code-level transaction
On failure, Inngest retries the step with backoff
On success, Inngest checkpoints the result
If your process crashes or you redeploy, the workflow resumes from the last successful step, not from the beginning

What it does well

1. Durability & Checkpointing

Every Step is durable. In practice:

If tool-analytics times out on call 7 of 10, only that step is retried
Steps 1–6 are preserved; you don’t re-hit those tools or re-run the LLM plan generation
If your runtime (edge, serverless, or traditional) restarts, the next run picks up from the failed step

You get this just by using step.run()—no custom state store, no homegrown “resume token” logic.

2. Automatic Retries with Backoff

Retries are built in at the step level. You can customize behavior per tool:

const result = await step.run(
  "tool-search",
  {
    retry: {
      attempts: 5,
      maxDuration: "10m",
      factor: 2,             // exponential backoff
      minDelay: "2s"
    }
  },
  async () => {
    return await searchTool({ query: event.data.query });
  }
);

Mechanics:

Transient failures (timeouts, 500s, network flakiness) are retried
Each attempt is tracked in Traces, so you see every retry, not just the final failure
Once a step finally succeeds, Inngest ensures it’s run exactly once for the workflow

3. Idempotency Without Reinventing It

Because Inngest checkpointing ensures a successful step won’t re-run, a lot of idempotency pain vanishes:

If you send an email or write to a CRM in a Step, it won’t run again after success—even if the worker crashes later
If you need extra safety, you can still add idempotency keys at the tool layer, but you’re not relying on them for basic resilience

4. Observability & Recovery with Traces and Replay

Inngest Cloud gives you instant Traces for each agent run:

Step-by-step view of the workflow
Step inputs/outputs (including prompts and LLM responses for AI workflows)
Retry attempts and error messages
Duration per step

When something goes wrong, you can:

Query: filter runs by tenant, event data, error message, or step
Cancel: stop noisy or stuck runs in bulk
Replay: rerun failed workflows from the last successful checkpoint—no bespoke admin panel

For AI agents, this is especially valuable: you can inspect every prompt / response pair and see which tool call caused the issue.

5. Flow Control for Multi-Tenant Agents

If your agent runs per tenant or per user, multi-tenant flow control matters:

export const agentWorkflow = inngest.createFunction(
  {
    id: "agent-workflow",
    concurrency: {
      key: "event.data.accountId", // per-tenant
      limit: 3                     // max 3 runs per tenant
    }
  },
  { event: "agent/run" },
  async ({ event, step }) => { /* ... */ }
);

Mechanics → Outcome:

Concurrency keys prevent noisy neighbors: one tenant’s heavy load won’t starve others
Throttling and rate limits let you respect external API quotas without rewriting the agent logic
Prioritization lets you make some runs (e.g., user-facing) preempt background ones

Tradeoffs & Limitations

You adopt the Inngest model: functions + events + steps. That’s a shift from “just drop everything in one Lambda.”
You still design your agent architecture (planning, tool selection, memory). Inngest doesn’t decide how your agent reasons—it ensures those decisions execute reliably.
For extremely simple, single-call agents, this can feel like “more structure than you need” (though you’ll usually grow into needing it).

Decision Trigger

Choose the Inngest durable agent workflow if:

Your agent makes multiple tool calls or LLM calls per request
You care about not losing partial progress when something crashes
You want Retries, Checkpointing, and Traces out of the box and don’t want to rebuild workers, queues, and recovery tooling

2. Custom Queue + Workers (Best for teams that want full infra control)

The classic approach: you wire up a queue (SQS, RabbitMQ, Kafka) plus workers (Lambda, containers, K8s) and encode your agent workflow in jobs and messages.

What it does well

Full control over how jobs are scheduled, retried, and routed
Can be tuned precisely to your infra (custom backoff, DLQ routing, prefetch, etc.)
Works well if you already have a mature queue/worker stack and an SRE team maintaining it

Where it hurts AI agent workflows

1. You’re rebuilding Steps and Checkpointing

To get behavior comparable to step.run():

You model each tool call as its own job
You maintain state between jobs (DynamoDB/Postgres/etc.)
You create a “resume token” so you can restart from the last successful job

You’re essentially building your own durable workflow engine. The work is:

Defining job schemas and transitions
Handling partial failures and backoffs
Coordinating replays and idempotency across multiple services

2. Manual Retries and DLQs

Queues give you basic retries, but:

You configure retry counts on the queue and/or worker
Failures go to a dead-letter queue (DLQ)
Recovery becomes “manually inspect DLQs and replay messages”

For multi-step AI agents, this is brittle:

One stuck tool call sends messages to a DLQ that doesn’t know the workflow context
Replaying from DLQ re-runs steps that may have already succeeded unless you layer idempotency everywhere

3. Observability is a Build Project

To get a trace of “Agent run 123 called Tools A, B, C with these prompts and got these outputs,” you usually need to:

Add trace IDs to messages
Ship logs to a central store
Build a UI or query pattern to reconstruct runs
Integrate Prometheus or Datadog manually

That’s all infrastructure tax, not agent logic.

Decision Trigger

Choose custom queue + workers if:

You’re heavily invested in an existing queue/worker stack and have infra capacity to extend it
You need very custom behavior that can’t be modeled in a function + steps pattern
You’re comfortable owning DLQs, idempotency, and observability for the long term

3. Single Serverless Function (Best for simple, short-lived agents)

A lot of teams start here: a single Lambda/edge/serverless function that:

Receives an HTTP request or webhook
Calls the LLM and tools inline
Returns the final answer

What it does well

Easy to prototype
Minimal moving parts; deploy once and you’re live
Good for simple “one-call” agents (e.g., “call this API and summarize the result”)

Where it breaks for reliable AI agents

1. No Checkpointing

If your function:

Times out (long LLM call, slow API)
Crashes (runtime error, OOM)
Is killed during deploy

…you lose everything. There’s no “resume from step 3 of 7”—you just re-run the whole function, including:

Regenerating the plan
Re-calling successful tools
Risking duplicate side effects (double emails, double writes)

2. Ad-hoc Retries

You can add manual try/catch + setTimeout or re-invoke the function, but:

You’re implementing your own backoff logic
You still don’t have durability between retries (if the function restarts, your in-memory state is gone)
It’s easy to accidentally retry the entire agent instead of a single failed call

3. Limited Observability

You mostly have:

Logs (scattered across function invocations)
Maybe a tracing integration if you wire it in yourself

You don’t get a first-class view of:

Each step’s inputs/outputs
Which tool call failed
How many retries happened and why

Decision Trigger

Use a single function if:

Your agent is essentially a thin wrapper around one tool or one LLM call
You’re in early prototype mode and can accept recomputation and occasional partial failures
You don’t yet need multi-step, multi-tenant durability

You’ll likely outgrow this as soon as you have more than 2–3 tool calls or multi-tenant usage.

How to Implement a Durable AI Agent Workflow with Inngest

Let’s put it all together in a concrete pattern you can adapt.

1. Define the “run agent” event

Your agent starts from an event—API call, webhook, or schedule:

// Somewhere in your API route:
import { inngest } from "./client";

await inngest.send({
  name: "agent/run",
  data: {
    accountId: "acct_123",
    userId: "user_456",
    query: "Analyze traffic and suggest SEO improvements",
  },
});

Triggers can be:

API calls (from your backend or edge functions)
Webhooks (e.g., CRM events, GitHub, Stripe)
Schedules (e.g., nightly account review agents)

2. Create the durable agent function

import { inngest } from "./client";
import { llm, runTool } from "./agent";

export const agentWorkflow = inngest.createFunction(
  {
    id: "agent-workflow",
    concurrency: {
      key: "event.data.accountId",
      limit: 2,
    },
  },
  { event: "agent/run" },
  async ({ event, step }) => {
    const { query, accountId } = event.data;

    // Step 1: Plan
    const plan = await step.run("plan", async () => {
      return await llm.generatePlan({ query, accountId });
    });

    // Step 2: Execute tools with per-tool retries
    const toolResults = [];
    for (const toolCall of plan.toolCalls) {
      const result = await step.run(
        `tool-${toolCall.name}`,
        {
          retry: {
            attempts: 4,
            maxDuration: "15m",
            factor: 2,
            minDelay: "1s",
          },
        },
        async () => {
          return await runTool(toolCall, { accountId });
        }
      );

      toolResults.push({ name: toolCall.name, result });
    }

    // Step 3: Summarize
    const summary = await step.run("summarize", async () => {
      return await llm.summarize({ query, toolResults });
    });

    return { summary, toolResults };
  }
);

Key behaviors:

Each tool call is its own Step → retriable, checkpointed
If any tool crashes, only that Step is retried
If the deploy happens mid-run, Inngest resumes from the last successful Step
In Traces, you see: plan → tool-foo → tool-bar → summarize with all inputs/outputs

3. Run locally with one command

Local dev doesn’t require extra infra:

npx --ignore-scripts=false inngest-cli dev

This gives you:

Local Inngest Dev Server
A UI to inspect runs and steps
Automatic reload when you edit functions

Your code remains “just business logic” with step.run() sprinkled in where durability matters.

4. Operate in production

With Inngest Cloud:

Deploy your functions from the environments you already use (edge, serverless, traditional)
Use Traces to debug and understand agent behavior in real time
Use Replay and Bulk Cancellation to recover from incidents (e.g., misconfigured tool, bad model version) without writing custom admin UIs
Export metrics to Prometheus/Datadog if you need them, and rely on built-in alerting/retention

All with production guardrails like SOC 2 Type II, SSO/SAML, optional HIPAA BAA, and support for 100K+ executions per second.

Final Verdict

If you want an AI agent that can retry tool calls safely and keep progress even when steps crash, you need durability expressed in code, not scattered across queues, DLQs, and log dashboards.

A single serverless function is fast to start but brittle for multi-step agents.
A custom queue + worker stack can be made reliable, but you’ll rebuild years of reliability work: retries, checkpointing, observability, replay.
Inngest gives you code-level durability via step.run(), instant Traces for every prompt and tool call, and first-class recovery tools so you can query, cancel, or replay without re-implementing infrastructure.

Model your agent as an Inngest Function, wrap each tool call in a Step, and let the platform handle retries and checkpointing so you can focus on the agent’s behavior—not on keeping it alive.

Next Step

Get Started

How do I build an AI agent workflow that can retry tool calls and keep progress even if a step crashes?

At-a-Glance Comparison

Comparison Criteria

1. Inngest Durable Agent Workflow (Best overall for production AI agents)

Core Idea: Model each tool call as a Step

What it does well

1. Durability & Checkpointing

2. Automatic Retries with Backoff

3. Idempotency Without Reinventing It

4. Observability & Recovery with Traces and Replay

5. Flow Control for Multi-Tenant Agents

Tradeoffs & Limitations

Decision Trigger

2. Custom Queue + Workers (Best for teams that want full infra control)

What it does well

Where it hurts AI agent workflows

1. You’re rebuilding Steps and Checkpointing

2. Manual Retries and DLQs

3. Observability is a Build Project

Decision Trigger

3. Single Serverless Function (Best for simple, short-lived agents)

What it does well

Where it breaks for reliable AI agents

1. No Checkpointing

2. Ad-hoc Retries

3. Limited Observability

Decision Trigger

How to Implement a Durable AI Agent Workflow with Inngest

1. Define the “run agent” event

2. Create the durable agent function

3. Run locally with one command

4. Operate in production

Final Verdict

Next Step

Keep Reading

More from Durable Workflow Orchestration

How do I contact Temporal sales for Enterprise/Mission Critical, and what should we prepare (APS/Actions, retention, regions, SSO/SAML, SLA)?

Temporal Cloud on AWS Marketplace: how does billing work and can we use marketplace procurement?

Temporal Cloud security/procurement checklist: what data is stored, how is it encrypted, and what compliance docs are available?