
Serverless workflow orchestration: Step Functions vs third-party tools for teams that don’t want AWS lock-in
Most teams land on AWS Step Functions for one reason: you’re already deep in Lambda, SQS, and EventBridge, and you want something to glue it all together. But if you’re worried about AWS lock-in—or you already run workloads across edge, serverless, and Kubernetes—the tradeoffs get sharp very quickly.
This guide walks through how Step Functions compares to third‑party workflow tools (with Inngest as a concrete reference point) for serverless workflow orchestration, specifically for teams that want cloud flexibility without rebuilding queue infrastructure.
Quick Answer: The best overall choice for serverless workflow orchestration without hard AWS lock-in is Inngest. If your priority is “stay inside AWS and use managed services only,” AWS Step Functions is often a stronger fit. For teams who want visual-first flow modeling and don’t mind heavier infra, consider a general-purpose workflow engine (e.g., Temporal‑style systems).
Quick Answer: The best overall choice for serverless workflow orchestration without AWS lock-in is Inngest. If your priority is “stay fully managed inside AWS,” Step Functions is often a stronger fit. For visual-first modeling and heavier control-plane customization, consider a general-purpose engine.
At-a-Glance Comparison
| Rank | Option | Best For | Primary Strength | Watch Out For |
|---|---|---|---|---|
| 1 | Inngest | Teams avoiding cloud lock-in, multi-tenant SaaS, AI/agent workflows | Code-level durability with no workers/queues to manage | Adds an external control plane vs “all-in AWS” |
| 2 | AWS Step Functions | All-in AWS teams orchestrating Lambda-centric workloads | Deep AWS integration and fully managed scaling | Strong AWS lock-in, JSON DSL, per-state costs |
| 3 | General-purpose workflow engine (e.g., Temporal-style) | Platform teams standardizing orchestration across many apps | Extreme flexibility and self-host control | You own clusters, workers, scaling, and ops overhead |
Comparison Criteria
We evaluated each option against these criteria, which map to the real pain I’ve seen in production:
- Lock-in vs portability: Can you move or extend workloads across providers (Vercel/Netlify/Fly.io/Kubernetes/on‑prem) without rewriting your orchestration layer?
- Developer ergonomics & durability in code: Can you express retries, idempotency, and checkpointing as native code primitives, or are they hidden in YAML/JSON and config screens?
- Operational burden & observability: How much “queue stack” must you maintain (workers, DLQs, rate limits), and how easily can you trace, debug, and replay workflows in production?
Detailed Breakdown
1. Inngest (Best overall for lock-in‑free, code-first durability)
Inngest ranks as the top choice because it gives you Step Functions‑grade durability with far less infrastructure—and without tying the orchestration plane to a single cloud.
At its core, you write workflows with inngest.createFunction() and step.run() in your own codebase. Each step is automatically durable: retried on failure, checkpointed on success, and resumed from the last good step instead of starting from scratch.
What it does well
-
Code-level durability, not JSON state machines
With Inngest, your workflow is your code:
import { inngest } from "./client"; export const syncUser = inngest.createFunction( { id: "sync-user" }, { event: "user/created" }, async ({ event, step }) => { const user = await step.run("load-user", async () => { return fetchUserFromDB(event.data.userId); }); const crmUser = await step.run("sync-to-crm", async () => { return upsertUserInCRM(user); }); await step.run("notify", async () => { await sendWelcomeEmail(crmUser.email); }); } );Each
step.run()is a durable transaction: it retries automatically on failure and won’t re‑run once it succeeds. You get exactly-once semantics for each unit of work without hand‑rolling idempotency or compensating logic. -
Infraless by design
You don’t run workers, queues, or schedulers:
- No SQS/SNS/Step Functions stack to wire.
- No DLQ pipelines to maintain.
- No custom retry semantics scattered across services.
Local dev is a single command:
npx --ignore-scripts=false inngest-cli devThen you deploy your app wherever you want—edge, serverless (e.g., Vercel, Netlify, AWS Lambda), or traditional servers/Kubernetes—while Inngest provides the durable control plane.
-
Agnostic execution and triggers
Inngest is built to run “wherever your code lives”:
- Environments: edge, serverless, traditional VMs or containers, Kubernetes.
- Triggers: API calls, webhooks, schedules, and arbitrary events.
That means you can start inside AWS and later move parts of your system to another provider without rewriting your orchestration logic.
-
Flow control for multi‑tenant systems
As a former multi‑tenant SaaS engineer, this is where I see teams re‑invent the same components: per‑tenant rate limits, concurrency caps, and prioritization.
Inngest bakes this in as Flow Control:
- Multi‑tenant concurrency keys to prevent noisy neighbors.
- Throttling and prioritization without rewriting business logic.
- Batching when you want to smooth load for downstream APIs.
Instead of designing a queue topology every time a new tenant lands a big data import, you describe constraints once and let the platform enforce them.
-
Observable by default: Traces, structured logs, Replay
Every run has step‑level Traces:
- Inputs and outputs per
step.run(). - Structured logs, not “print and grep.”
- Real-time traces, including “every prompt / response pair” for AI flows.
- Query, cancel, or replay runs directly in the UI.
When something fails, you don’t assemble context from CloudWatch, X-Ray, and random log streams—you open a Trace, inspect each step, and either replay a single run or bulk‑replay a segment of traffic.
In Inngest Cloud, you also get:
- Metrics and alerting.
- Trace retention by plan.
- Optional exports to Prometheus and Datadog.
- Inputs and outputs per
-
Enterprise-grade without building it yourself
Inngest is trusted in production at Replit, SoundCloud, Cohere, TripAdvisor, Resend, GitBook, and more. For teams with higher bar requirements, you get:
- SOC 2 Type II.
- End‑to‑end encryption middleware.
- SSO & SAML.
- HIPAA BAA availability.
- Scale claims like 100K+ executions per second and low‑latency execution.
Tradeoffs & Limitations
-
External control plane vs “everything in AWS”
If your security posture or org structure requires all control planes to live inside AWS, bringing in a third‑party orchestration platform is an additional vendor and trust surface to manage.
Practically, you’ll still run your own code in your environment, but you’ll rely on Inngest’s cloud to coordinate runs, Traces, and control operations (cancel/replay).
-
DX is code-first, not diagram‑first
Inngest favors code and Traces over drag‑and‑drop visual designers. If your team is primarily non‑coding stakeholders who want to rewire flows visually, a visual‑native engine may feel more familiar (although most “visual” tools still require code for real complexity).
Decision Trigger
Choose Inngest if you want durable serverless workflows that:
- Run across edge/serverless/traditional environments.
- Express retries, checkpointing, and idempotency in code (
step.run()). - Avoid AWS lock‑in while still integrating deeply with AWS when you need it.
- Give you built‑in multi‑tenant Flow Control and production‑grade Traces/Replay without building your own internal tooling.
2. AWS Step Functions (Best for all‑in AWS teams)
AWS Step Functions is the strongest fit when your priority is to stay fully within AWS and orchestrate Lambda‑centric or AWS‑service‑centric workflows with a managed service.
You describe your flow in Amazon States Language (a JSON DSL), wire it to Lambdas and other AWS services, and you’re done—no separate control plane to run.
What it does well
-
Deep AWS integration
Step Functions integrates natively with:
- Lambda, ECS, Batch, DynamoDB, SQS/SNS, EventBridge, and more.
- Service integrations like calling AWS APIs directly from the state machine.
- IAM for access control across all those services.
If your entire stack is AWS and you don’t expect that to change, having orchestration live right next to your compute and storage can be operationally simple.
-
Managed scaling and reliability
AWS handles:
- Horizontal scaling of state machine executions.
- Durable state storage.
- Retries and backoff policies at the state level.
- Integrations with CloudWatch Logs, CloudWatch Metrics, and X-Ray.
You don’t think about provisioning workers or tuning worker pools; you just pay per transition.
-
A visual console for high‑level flows
The Step Functions console lets you see execution graphs, state transitions, and error paths in a visual view. For high‑level system diagrams, this can be useful, especially when onboarding new engineers.
Tradeoffs & Limitations
-
Strong AWS lock‑in
This is the big one for teams reading an article like this.
- Workflows are defined in an AWS‑specific JSON DSL.
- Observability is tied to CloudWatch, X‑Ray, and AWS tooling.
- Moving to another cloud or to hybrid/edge setups usually means a full rewrite of both your orchestration layer and often your execution environment (e.g., moving off Lambda).
If you eventually want to run parts of your stack on Vercel, Fly.io, or Kubernetes outside AWS, Step Functions doesn’t come with you.
-
JSON DSL vs native code
Writing and maintaining Amazon States Language is:
- Verbose and harder to refactor than real code.
- Awkward to review in PRs once flows become non‑trivial.
- Easy to drift from the business logic living in your Lambdas or containers.
You end up splitting durability logic between state machine definitions and application code, which complicates debugging and versioning.
-
Per‑state pricing and noisy neighbors
You pay per state transition. In high‑volume systems, this can:
- Incentivize you to pack logic into fewer, bigger states (reducing visibility and composability).
- Create surprising bills when certain tenants or workflows explode in usage.
Handling multi‑tenant throttling and per‑tenant concurrency is also non‑trivial: you often fall back to separate queues, per‑tenant “sharding,” or Lambda‑side logic to protect shared downstream services.
-
Observability glued from multiple AWS tools
Step Functions’ visual execution charts help, but:
- Full context still involves CloudWatch Logs, metrics, and X‑Ray traces.
- You’ll often stitch an incident together from multiple services’ logs.
- Bulk replaying failed workflows or repairing partial state typically requires custom scripts and ad‑hoc tooling.
This is where I’ve seen teams spiral into “log-grepping across systems” after multi‑step failures.
Decision Trigger
Choose AWS Step Functions if you:
- Are committed to AWS as your primary and long‑term cloud.
- Want a fully managed, AWS‑native orchestrator for Lambda and AWS services.
- Prefer keeping orchestration, logs, metrics, and IAM all inside one ecosystem, even if it makes future portability more expensive.
3. General-purpose workflow engine (Best for platform teams owning infra)
By “general-purpose workflow engine,” I mean self‑hosted or heavy‑weight systems in the Temporal/Cadence/Camunda category—engines you run yourself on Kubernetes or VMs to coordinate arbitrarily complex flows.
These stand out when you need extreme control and are willing to own the infrastructure.
What it does well
-
Maximum flexibility and language options
These engines often support:
- Multiple SDKs: Go, Java, TypeScript, Python, etc.
- Long‑running workflows that can pause and resume over months.
- Complex saga patterns, compensation, and custom retry logic.
- Separating orchestration logic from business logic across many services.
If you think in terms of orchestration as a “shared platform” for dozens of teams, this can be the right abstraction.
-
Self-host control plane
You run the control plane yourself:
- Full control over scaling, topology, and deployment.
- Data residency and compliance handled in your own infra.
- Potential to run inside any cloud or on‑prem.
For some regulated environments, this is a requirement.
Tradeoffs & Limitations
-
You inherit the queue stack
Using a self‑managed engine means:
- Running clusters (often Kubernetes) plus worker fleets.
- Tuning capacity, autoscaling policies, and shard allocations.
- Operating internal queues, DLQs, and recovery processes.
- Managing upgrades and migrations of the workflow engine itself.
From experience, this becomes a platform team in its own right—great when staffed, painful when it’s “some infra engineer’s side job.”
-
Higher adoption cost for product teams
For app teams:
- Onboarding is slower; the platform is powerful but complex.
- Local dev setup often requires Docker/kind/minikube stacks.
- Observability depends on how you wire logging, metrics, and tracing—in other words, you’re building your own Traces/Replay surfaces.
You’ll get power users on platform/infrastructure, but not every feature team wants to think at that level of abstraction.
-
Visuals and tooling vary widely
Some engines include visual consoles; others require third-party tools. Either way, you’ll often have to:
- Integrate logging and traces into your existing stack (Datadog, Prometheus, etc.).
- Build custom admin tools for canceling, replaying, and inspecting runs.
Decision Trigger
Choose a general-purpose workflow engine if:
- You have a dedicated platform team and want orchestration as a shared internal product.
- You must self‑host the control plane for compliance or strategic reasons.
- You’re willing to absorb significant operational complexity in exchange for full control.
Step Functions vs third‑party tools: how to decide
If you strip away logos and clouds, the choice boils down to three questions:
-
Are you comfortable locking orchestration to AWS for the next 3–5 years?
- Yes, we’re all‑in AWS: Step Functions is a solid, managed default.
- No, we want cloud flexibility / edge / multi‑cloud: Prefer Inngest or a self‑hosted engine.
-
Do you want durability expressed in JSON/YAML config or in code?
- Config & diagrams feel natural: Step Functions or visual engines can work, but expect some drift between diagrams and real business logic.
- We want each unit of work as a named step in code: Inngest’s
step.run()model makes retries and checkpointing explicit and reviewable in PRs.
-
How much infrastructure are you willing to run?
- Minimum infra, managed platform: Inngest and Step Functions both avoid running your own orchestration clusters, but Inngest keeps your app portable across clouds.
- We’re okay owning clusters and workers: A general-purpose engine gives you maximal control at the cost of more operational toil.
From my own incident history, the most painful failures weren’t “Lambda timed out” but “a multi‑step sync partially failed, left dirty state, and we had to reconstruct what happened by stitching logs from three services.” That’s why I now favor:
- Code‑level steps (
step.run()). - Automatic checkpointing and retries.
- A first‑class Traces UI with replay and bulk cancellation.
Step Functions can approximate parts of this inside AWS; general-purpose engines can give you raw power if you’re willing to build the tooling. Inngest is the option that bakes those durability and recovery primitives directly into the DX, without tying you to a single cloud or making you run workers.
Final Verdict
- Pick Inngest if you want serverless workflow orchestration that stays portable beyond AWS, keeps durability in your own code, and removes the need to build and maintain the queue stack (workers, queues, DLQs, custom replay tools). It’s a fit for modern multi‑tenant SaaS, AI agents, and event‑driven systems that span edge, serverless, and Kubernetes.
- Pick AWS Step Functions if you’re committed to AWS as your long‑term home and want a managed, service‑integrated orchestrator with no extra vendors. Accept that you’re encoding orchestration in an AWS‑specific DSL and that moving clouds later will be expensive.
- Pick a general-purpose workflow engine if you have a platform team, must self‑host the control plane, and are prepared to own the ongoing operational complexity in exchange for deep control.
If you’re in the camp of “I never want to rebuild durable queues, retries, and replay again,” the Inngest model—code-level Steps, instant Traces, and built-in Flow Control—is closer to where modern teams are heading.