AWS Lambda Durable Functions arrived on December 2, 2025 with one clear promise: make long‑running, multi‑step workflows feel native in Lambda. You can checkpoint progress, wait minutes or months (up to a year), and resume without paying for idle compute. If your team has been juggling Step Functions for simple sequences, or hand‑rolling retry/state logic, AWS Lambda Durable Functions gives you a simpler path with far fewer moving parts.
What actually shipped—and what it means
Here’s the short, actionable recap of the launch details that matter when you’re deciding what to build next:
- Durable Functions are available for Lambda with native operations to define steps, checkpoint state, and pause/resume execution.
- Executions can suspend for up to one year while waiting on external events, human approvals, or downstream systems—without incurring compute charges during the wait.
- General availability began in US East (Ohio), with support for Node.js 22/24 and Python 3.13/3.14. You configure via the Console, API/CLI, CloudFormation, AWS SAM, or the CDK.
- They’re meant for multi‑step apps and AI workflows where you previously wrote custom orchestration code or leaned on Step Functions for relatively simple sequences.
Bottom line: it’s Lambda’s event model, but now it can hold a thought over time.
Why this matters more than another launch blog
There’s a practical tension teams have lived with for years. Step Functions are industrial‑strength, observability‑friendly, and battle‑tested—but they can feel heavy for small to medium workflows. Custom orchestration code inside Lambda is flexible—but it’s fragile, hard to reason about, and easy to break under retries or partial failures. Durable Functions splits the difference: Lambda ergonomics with built‑in state and waits.
That translates to faster feature delivery. A product manager wants “verify user, collect documents, run checks, then approve later when a human signs off.” You ship in a day, not a week. You don’t stand up queues and tables and scheduled pollers to babysit long‑lived work items. You define steps, checkpoint progress, and let the platform hold the state safely.
Is AWS Lambda Durable Functions a Step Functions replacement?
Not usually. Think of Durable Functions as a great fit for application‑level sequences where developers own the code and want minimal ceremony. Use Step Functions when you need:
- Complex branching with visual control (choice, parallel, map, dynamic fan‑out/fan‑in) and native service integrations across dozens of AWS services.
- Cross‑team workflows where a visual state machine—and audit trail—improves reliability, operations, and compliance.
- High‑volume orchestration with service quotas you already tuned and operationalized.
Use Durable Functions when you want:
- Simple to medium workflows embedded directly in Lambda code.
- Long waits (seconds to months) for human or third‑party callbacks without paying for idle compute.
- Fast iteration by the same team that owns the business logic, tests, and deployment pipeline.
The net: Durable Functions will shave a big slice of “Step Functions used for simple sequences.” For big, auditable orchestrations, Step Functions stays your go‑to.
What’s new for AI workflows?
The serverless+AI story just got sturdier. Durable Functions are ideal for agent loops that need to pause for tool results, human verification, or scheduled follow‑ups. Pair this with Amazon S3 Vectors, which became generally available on December 2, 2025 with support for up to two billion vectors per index and typical query latencies near 100 ms for frequently accessed data. That lets you keep vector data cheap and elastic while your functions sleep and wake on demand.
If you’re weighing storage for embeddings and retrieval, we published a deeper decision framework in Amazon S3 Vectors: The Buy‑or‑Build Decision Guide. Combine that guidance with Durable Functions and you can stand up resilient RAG or agent workflows without over‑architecting.
Hands‑on: a 7‑step rollout plan your team can run in one sprint
Here’s how we’ve piloted Durable Functions with product teams without derailing roadmaps:
- Pick one workflow where you currently poll a queue or keep “in‑progress” rows in a table. Target something with 3–7 steps and at least one external callback or human approval.
- Model the steps as discrete functions. Decide which steps can retry safely (idempotent) and which must be guarded with a uniqueness key.
- Define waits and signals. For human approvals, wire a signed link that posts a signal to resume. For third‑party callbacks, use an API Gateway path or EventBridge rule that targets the specific durable execution.
- Implement checkpointing. Persist only what you need to resume deterministically: prior responses, correlation IDs, and any token needed to continue external work.
- Plumb observability. Emit a structured log per step with executionId, stepName, attempt, and correlationId. Create a one‑page dashboard: inflight count, average wait time, failure rate by step.
- Chaos test your waits. Kill the network, drop callbacks, send them twice, send them late. Verify the durable execution recovers, de‑dupes, and either times out or lands in a resolvable state.
- Roll out behind a flag. Shadow‑run the durable path for a subset of users or test tenants. Compare cycle times and failure modes to your legacy orchestration.
People also ask
How long can a durable function run?
Up to a year of suspension time between steps. That’s perfect for approvals, billing holds, or vendor SLAs that stretch beyond days.
Which runtimes are supported?
At launch: Node.js 22 and 24; Python 3.13 and 3.14. If you’re eyeing Node.js 24 anyway, our zero‑drama upgrade notes are here: Node.js 24 LTS: The Zero‑Drama Upgrade Plan.
Is Durable Functions cheaper than Step Functions?
It depends on your pattern. Durable Functions don’t charge compute while waiting, and you avoid running always‑on orchestrators you might have built yourself. Step Functions, meanwhile, can be more cost‑efficient for very large fan‑out workflows due to its native integrations and pricing model. Price both with realistic volumes.
Does this make retries and idempotency “go away”?
No. Durable Functions handle checkpoints and resumption, but you still own idempotency at every call boundary (payments, emails, external APIs). Treat every step as “at least once” and design accordingly.
Architecture patterns you’ll actually build in Q1
1) Human‑in‑the‑loop KYC
Steps: capture docs, run automated checks, pause for manual review, resume to provision account, send welcome sequence. The manual review might land days later; you won’t pay Lambda compute while waiting. Each step emits a correlationId so customer support can trace the journey in one view.
2) AI agent with vector retrieval
Steps: receive task, retrieve context from an S3 vector index, call your model/tool, store outputs, optionally pause for a human verify step, resume to post results. Durable Functions give you reliable long waits; S3 Vectors keeps your embeddings cheap at scale. If you’re still deciding on storage, revisit our S3 Vectors decision guide linked above.
3) Order orchestration across vendors
Steps: validate cart, reserve inventory, charge payment, create shipments across multiple carriers, pause for carrier callback, reconcile and notify. Carriers often callback hours later; the workflow remains stable without idle compute burn.
Operational realities: what can bite you
Cold starts still matter. Durable Functions don’t eliminate cold starts. If the first step is latency‑sensitive, keep that handler warm using provisioned concurrency or design a fast pre‑check path.
Keep your steps stateless. Checkpoint the minimum state needed to resume. Don’t stash large payloads—persist large artifacts to S3 and reference by key.
Idempotency tokens are non‑negotiable. Upstream retries happen. Every external call (email, payment, ticket creation) should include a token your downstream system treats as unique.
Observability by construction. Bake a consistent schema into every step log. We use executionId, stepName, attempt, userId/tenantId, and elapsedMs. Alert on “stuck” waits past a business‑defined SLA.
Region and runtime guardrails. Launch started in Ohio with specific Node/Python versions. If your production region or runtime differs, plan a phased adoption or wait for your region to land; don’t yank mature Step Functions to chase novelty.
Cost strategy: pair durable workflows with smart commitments
Your workflow runs touch more than Lambda. If you’re hitting relational or key‑value databases along the path, new Database Savings Plans offer up to 35% savings across engines, families, sizes, and Regions on a one‑year commitment. That flexibility lets you modernize (say, RDS to Aurora or DynamoDB) without losing the discount. For a practical buying framework, start with our guide: AWS Database Savings Plans: The Practical Playbook.
For serverless‑native AI retrieval, Amazon S3 Vectors’ general availability brought higher scale and lower cost for embeddings. Durable Functions plus S3 Vectors is a sweet spot for cost control and performance in agentic apps.
Let’s get practical: migration and greenfield checklists
Greenfield (new workflow)
- Model 3–7 steps with clear inputs/outputs and failure modes.
- Define timeouts and business SLAs per step; add a global “max elapsed time.”
- Decide your wait strategy: callback URL, EventBridge, or human approval UI.
- Create a single tracing context—propagate it across steps, logs, and outbound calls.
- Store artifacts in S3 and reference by key; never pass megabyte payloads between steps.
Brownfield (replacing ad‑hoc orchestration)
- Inventory current queues/tables and cron jobs that “babysit” state; target the noisiest path first.
- Add idempotency keys everywhere you touch the outside world.
- Shadow‑run the durable path for low‑risk tenants or a beta cohort.
- Keep Step Functions where you rely on its visual audit trail or fan‑out power.
- Stage a rollback: on error spikes, route traffic back to the legacy path via a feature flag.
Runtimes, versions, and the boring but important stuff
If you’re on Node.js 18 or 20, this is a good trigger to plan your Node.js 24 move so you can adopt Durable Functions without running mixed runtimes. We maintain a pragmatic upgrade plan here: Node.js 24 LTS: The Zero‑Drama Upgrade Plan. Python teams running 3.11 should plan test coverage for 3.13 or 3.14; retune cold‑start‑sensitive handlers and rebuild layers with the new runtimes.
On IAM, start with least privilege for durable execution APIs and your callback endpoints. If you’re generating policies programmatically, keep a denial list for destructive actions that are never allowed from a resumed step.
When not to use Durable Functions
There are times you should hold off:
- You need dozens of native AWS service integrations and complex visual branches—Step Functions is better.
- Your compliance team depends on Step Functions’ state machine history and visual audit for audits.
- Your region/runtime isn’t supported yet. Don’t hop Regions just to try the feature if it complicates data residency or latency.
- You need sub‑second start‑to‑finish latency. Durable is about reliability over time—not shaving milliseconds.
What about Lambda Managed Instances?
AWS also introduced a way to run Lambda on EC2 under the hood while keeping the serverless model. That’s intriguing for specialized hardware or cost tuning under EC2 pricing. For most teams, adopt Durable Functions first—you’ll feel impact immediately. Explore Managed Instances later if you need specific EC2 capabilities in your Lambda fleet.
Security and governance notes
Treat callback endpoints like privileged APIs. Use one‑time tokens, short expirations, and strict audience claims. For human approvals, sign links that bind user identity and workflow instance; log the action with IP, agent, and time. Add denial safeguards in code for any step that could destroy resources or double‑charge customers, even if an attacker gains a callback token.
Finally, wire guardrails for “long waits.” A zombie execution is an ops smell. Alert at T+X hours past expected completion, surface stuck steps in a daily Slack digest, and auto‑expire workflows that exceed business validity windows.
What to do next
- Run the 7‑step pilot above on one real workflow within two weeks.
- Choose a vector store strategy if you’re building agents; start with our S3 Vectors decision guide.
- Trim database spend while you modernize orchestration—review our Database Savings Plans playbook to model commitments.
- Align runtimes with the launch set; if you’re a Node shop, use our Node.js 24 LTS upgrade plan.
- If you want a partner to help design or pilot the first durable flow, see our services and reach out via ByBowu contacts.
Zooming out
Durable Functions isn’t fluff. It collapses a pile of boilerplate—queues, polling, tables, bespoke schedulers—into a first‑class Lambda experience. That means fewer failure modes, less glue code, simpler deployments, and faster delivery. Keep Step Functions where it shines; shift small and medium sequences to Durable Functions for speed and focus. Do that with a strong observability scheme and an adult relationship with idempotency, and your team will ship calmer, more reliable workflows—this month, not next quarter.