You usually use Step Functions when a workflow's complexity outgrows a single Lambda invocation, in terms of duration or of complexity. Lambda durable functions are another path, announced during re:Invent 2025. You write a Lambda handler, and you use a durable execution SDK that checkpoints after each durable operation and replays your code across multiple invocations so the execution can span a much longer duration: from a few minutes to up to one year.
Essentially it's a Lambda programming model + SDK (JavaScript/TypeScript and Python) that provides a DurableContext with durable operations (steps, waits, callbacks, parallel/map, child contexts). Each durable execution persists a checkpoint log and resumes by replaying from the top and skipping completed operations.
You enable durable execution when creating the Lambda function (you can’t retrofit an existing one), by setting DurableConfig (timeout + retention) via console/CLI/API/IaC; then you deploy code that uses the SDK wrapper/decorator. Then you invoke the function (typically asynchronously) using a qualified version/alias, monitor progress via CloudWatch metrics and EventBridge status-change events, inspect/stop executions via durable execution management commands, and complete callbacks via dedicated Lambda APIs.
Lambda Durable Execution Lifecycle and Replay
A durable execution is a logical run of your workflow that may span many Lambda invocations. The SDK checkpoints after each durable operation, and when the function resumes it replays your handler from the beginning while substituting stored results for already-completed operations. Basically, you write your orchestration code so that anything non-deterministic or side-effecting happens inside durable operations (especially step).
The Four Phases of Lambda Durable Functions
Start: A durable execution begins when you invoke the function with durable execution enabled.
Checkpoint: Each durable operation (e.g.,
step,wait, callback) causes a checkpoint record to be persisted before the execution moves on.Suspend and resume: Waits/callbacks intentionally end the current Lambda invocation and schedule/trigger a later resume. Lambda invokes the function again later and the SDK replays to the suspension point.
Complete: When your handler returns, the execution completes. Monitoring surfaces succeed/failed/timed out outcomes.
Replay is not a “resume from line N”
Replay means your code runs again from the top, but durable operations short-circuit by returning stored results. Determinism matters a lot in those operations: if you read Date.now(), random IDs, or environment values that change between runs outside a durable operation, you can diverge during replay and hit nondeterminism errors.
Here's a code example using JavaScript:
import { withDurableExecution, DurableContext } from "@aws/durable-execution-sdk-js";
export const handler = withDurableExecution(async (event: any, ctx: DurableContext) => {
// Pure orchestration logic here (replay-safe).
const input = event.input;
const validated = await ctx.step("validate", async () => {
// Do side effects and nondeterministic reads inside steps.
return { ok: true, input };
});
if (!validated.ok) return { status: "rejected" };
await ctx.wait("cooldown", { seconds: 30 });
return { status: "approved" };
});
Now your function is a deterministic program that can be re-run, and durable ops are the boundaries where results are persisted and re-used.
Anyshift, the AI engineer that shows you exactly what broke and why
Production incidents aren't hard because systems are complex. They're hard because nobody has the full picture when things break, and your team ends up burning 45 minutes just rebuilding context that should already exist somewhere.
You could ask AI, but generic copilots don't know your stack. You'll spend more time copypasting logs and filtering through guesses than doing actual root cause analysis.
Unlike other AI SREs, Anyshift automatically maps your cloud, K8s, code, and observability data into a live dependency graph. When something breaks, it already knows how everything connects.
Root cause in minutes: Solve incidents with full context.
Noise reduction: P3/P4 alerts are auto-triaged before they hit your phone.
Proactive safety: Risks are caught before they become incidents.
With security and auditability built into the foundation (SOC 2 Type II certified), it’s a 5-minute setup with no lock-in.
Get started at Anyshift.io
Lambda Durable Operations: steps, waits, callbacks
Durable functions give you three core primitives: steps (checkpointed work with retries), waits (pause without compute billing), and callbacks (pause until an external system responds). You access them via DurableContext methods. Design-wise, you should put all side effects in steps, and use waits/callbacks to “sleep” without tying up a Lambda invocation.
Steps
A step runs code and records its result to the checkpoint log. On replay, completed steps return the stored result instead of re-running. Steps can be retried, which implies they can run more than once unless you configure stricter semantics.
Waits
A wait checkpoints, ends the current invocation, and schedules resumption later. You use waits for backoff, human approval timeouts, polling intervals, etc. The key here is that you stop paying for Lambda compute while waiting.
Callbacks
Callbacks give you a callback ID you can hand to an external system. Your durable execution suspends until that system calls the Lambda callback completion APIs.
The callback ID is the primary identifier in the callback completion APIs (URI path parameter). Treat it as sensitive data and scope who can use it.
Rules of Thumb
If the code does I/O or side effects, put it in
stepand make it idempotent.If the code needs to wait for some period of time, use
waitand don’t poll inside a single Lambda invocation.If the code needs to wait for some external call, use callback + completion APIs.
Invocation modes and idempotency keys
Durable functions are designed to be invoked asynchronously, so the caller doesn’t block while the workflow may run for minutes, hours, or days. To prevent duplicate starts, use an execution name as an idempotency key. If you invoke again with the same name, Lambda rejects the duplicate with a durable-execution-already-exists error.
You can supply an execution name when invoking a Lambda Durable function. If an execution with that name is already running or completed, the service rejects the duplicate start with a DurableExecutionAlreadyExists error.
Step Idempotency vs Start Idempotency
Start idempotency prevents duplicate workflow starts. You set it via the execution name. Step idempotency prevents duplicate side effects inside the workflow, and you set it by using step execution mode plus external idempotency keys.
If duplicates come from the event source or caller retries, enforce execution name. That way repeated attempts to trigger the same workflow for the same reason (e.g. in response to the same event) will fail correctly.
If duplicates come from retries or timeouts within a step, use AT_MOST_ONCE_PER_RETRY and external idempotency tokens. This will avoid having your steps accidentally trigger the same behavior twice.
If you can’t define a stable idempotency key, Lambda Durable functions will just end up being very complex, and you should consider Step Functions with explicit task tokens, or a saga pattern for event-driven architectures.
Lambda Durable Functions vs Step Functions
AWS Lambda durable functions are a Lambda execution mode plus a Durable Execution SDK that adds durable primitives (steps, waits, callbacks, etc.) to a single Lambda handler. Your code is replayed from the start on resume, and completed durable operations are skipped using stored checkpoints.
AWS Step Functions is a separate orchestration service where you define a state machine (ASL JSON) and run executions. You wire tasks to AWS services (including Lambda) and get built-in workflow history + visualization.
Execution model and guarantees
In Step Functions, your source of truth is the state machine graph (ASL). Each state transition is managed by Step Functions, which records progress and drives task scheduling. Pricing and limits are centered around state transitions / executions.
Standard workflows guarantee exactly-once workflow execution, and can run up to one year. Express workflows guarantee at-least-once workflow execution, and can run up to five minutes.
For Lambda Durable Functions, your source of truth is the Lambda handler code, but durable operations create checkpoints. On resume, the SDK replays the handler from the beginning and returns stored results for previously completed durable operations. This creates an important constraint: your code must be deterministic across replays (especially any logic outside durable operations). The SDK explicitly warns about replay behavior and determinism.
Waits and human-in-the-loop
Step Functions uses patterns like Wait states or callback (“task token”) patterns. In the callback pattern, Step Functions hands you a taskToken, and an external actor later resumes the workflow by calling APIs like SendTaskSuccess (or the failure variant).
Lambda durable functions, on the other hand, uses durable operations like wait() and callback primitives like createCallback() / waitForCallback(). When waiting, the invocation terminates and later resumes without on-demand compute billed during the wait. External systems resume via Lambda APIs SendDurableExecutionCallbackSuccess / SendDurableExecutionCallbackFailure.
Integrations
Step Functions has native integrations that let you orchestrate many AWS services without creating Lambda functions just to call a service. You also get visual workflow debugging/history (though in my opinion the UI is pretty ugly).
With Lambda Durable Functions you can integrate with anything via AWS SDK calls inside steps (or use context.invoke() to call other Lambda functions as durable operations), which is more natural if you're more used to code. However, you’re building the control flow in code and you won’t get Step Functions’ state-machine-level service integration patterns that would let you just call a service without writing the code for that call.
Cost model
In Step Functions Standard workflows you pay per state transition (retries count too). The free tier includes 4,000 transitions/month (does not expire), and in us-east-1 you pay $0.000025 per transition. Step Functions Express workflows make you pay per request + duration (rounded to 100 ms) + memory used (billed in 64 MB chunks). In us-east-1 you'll pay $1.00 per million requests and $0.00001667 per GB-second (plus tiering at higher GB-hours).
With Lambda Durable Functions you still pay normal Lambda requests + duration (including sub-invocations due to resume/replay), plus durable charges:
Durable operations: $8.00 per million operations.
Data written by durable operations (GB): $0.25 per GB.
Data retained (GB-month, prorated): $0.15 per GB-month.
Decision guidance
Use Step Functions when you need:
A workflow as a first-class managed artifact (ASL + visualization + execution history).
Broad AWS-service orchestration with less “glue code” (especially across many services).
Standard’s semantics and tooling for long-running orchestration (up to 1 year) with exactly-once workflow execution.
Use Lambda durable functions when you need:
A code-first workflow where the orchestration logic lives in the Lambda handler and you want to use durable primitives (
step,wait,waitForCallback,parallel, etc.) with checkpoint+replay.To wait without paying for idle compute, and prefer to stay inside the Lambda programming model, and you’re willing to design for determinism/replay.
Therapy after dealing with Amazon States Language (ASL).
Building a Human Approval Workflow in Step Functions and Lambda Durable Functions
Let's see an example. Say our app needs to process a loan application with the following process:
Risk scoring (fast, deterministic-ish compute + external call)
If risk is low, approve automatically
Else, request human review and pause until the reviewer approves/rejects
Finalize decision (write to DB / emit event)
This example focuses on the “pause + resume via external callback” pattern because it shows the biggest mechanical difference between Step Functions and Lambda Durable Functions.
Step Functions implementation (Standard workflow + task token callback)
1) State machine definition (ASL JSON)
{
"Comment": "Loan approval with human-in-the-loop review (Standard)",
"StartAt": "RiskScore",
"States": {
"RiskScore": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${RiskScoreLambdaArn}",
"Payload.$": "$"
},
"OutputPath": "$.Payload",
"Next": "RiskDecision"
},
"RiskDecision": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.riskScore",
"NumericLessThanEquals": 30,
"Next": "AutoApprove"
}
],
"Default": "RequestHumanApproval"
},
"AutoApprove": {
"Type": "Pass",
"Parameters": {
"decision": "APPROVE",
"reason": "Auto-approved by risk threshold",
"application.$": "$"
},
"Next": "FinalizeDecision"
},
"RequestHumanApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"TimeoutSeconds": 86400,
"Parameters": {
"FunctionName": "${RequestApprovalLambdaArn}",
"Payload": {
"taskToken.$": "$$.Task.Token",
"application.$": "$"
}
},
"OutputPath": "$.Payload",
"Next": "FinalizeDecision"
},
"FinalizeDecision": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "${FinalizeLambdaArn}",
"Payload.$": "$"
},
"OutputPath": "$.Payload",
"End": true
}
}
}2) riskScore Lambda
// riskScore.js
export const handler = async (event) => {
// event: { applicationId, applicant, amount, ... }
// Keep this deterministic-ish; any nondeterminism should be inside the Lambda call itself (Step Functions won't replay your code).
const riskScore = Math.min(100, Math.max(0, (event.amount ?? 0) / 1000)); // placeholder logic
return {
...event,
riskScore,
scoredAt: new Date().toISOString()
};
};3) requestApproval Lambda (sends token to reviewer channel)
This Lambda receives taskToken and must deliver it to your approver UI/system. Step Functions will wait until something calls SendTaskSuccess or SendTaskFailure for that token.
// requestApproval.js
// This function is invoked by Step Functions with a taskToken.
// It should NOT call SendTaskSuccess itself unless you're auto-approving.
// It should send the token to your approval system (email, Slack, ticket, etc.)
export const handler = async (event) => {
const { taskToken, application } = event;
// In production: publish to SNS/SQS/EventBridge, create a ticket, notify Slack, etc.
console.log("Approval requested", {
applicationId: application.applicationId,
taskTokenPreview: taskToken.slice(0, 16) + "..."
});
// Step Functions "waitForTaskToken" pattern keeps the state open until callback.
// Return quickly (this Lambda call is just the submission step).
return {
status: "PENDING_REVIEW",
applicationId: application.applicationId
};
};4) “Approver callback” Lambda (calls SendTaskSuccess / SendTaskFailure)
Mechanics:
It receives
{ taskToken, decision, notes }(e.g., from API Gateway).On approve, call
SendTaskSuccesswithoutputJSON.On reject, call
SendTaskFailure.
// approverCallback.js
import { SFNClient, SendTaskSuccessCommand, SendTaskFailureCommand } from "@aws-sdk/client-sfn";
const sfn = new SFNClient({});
export const handler = async (event) => {
const { taskToken, decision, notes } = event;
if (!taskToken) throw new Error("Missing taskToken");
if (decision === "APPROVE") {
const output = JSON.stringify({ decision: "APPROVE", notes: notes ?? null });
await sfn.send(new SendTaskSuccessCommand({ taskToken, output }));
return { ok: true };
}
// Reject path: you can send structured error info.
await sfn.send(
new SendTaskFailureCommand({
taskToken,
error: "RejectedByReviewer",
cause: notes ?? "No notes"
})
);
return { ok: true };
};Lambda durable functions implementation (single handler + SDK callback)
0) What you need in code
You wrap your handler with the durable wrapper and use DurableContext operations.
npm install @aws/durable-execution-sdk-js1) Durable workflow handler (steps + waitForCallback)
This version:
Calls
context.step()for risk scoring and finalizationUses
context.waitForCallback()to create a callback ID, run the submitter function, and then block until callback completes (or times out)
// durableLoanWorkflow.js
import { withDurableExecution } from "@aws/durable-execution-sdk-js";
/**
* event: { applicationId, amount, applicant, ... }
*/
export const handler = withDurableExecution(async (event, context) => {
// Step 1: risk score (checkpointed)
const scored = await context.step("risk-score", async () => {
const riskScore = Math.min(100, Math.max(0, (event.amount ?? 0) / 1000)); // placeholder
return { ...event, riskScore, scoredAt: new Date().toISOString() };
});
let decision;
if (scored.riskScore <= 30) {
decision = { decision: "APPROVE", notes: "Auto-approved by threshold" };
} else {
// Step 2: human approval via callback
// waitForCallback creates callbackId, runs submitter (send request), then waits.
decision = await context.waitForCallback(
"human-approval",
async (callbackId) => {
// In production: publish callbackId + context to SNS/SQS/EventBridge, create ticket, etc.
console.log("Approval requested", {
applicationId: scored.applicationId,
callbackIdPreview: callbackId.slice(0, 16) + "..."
});
},
{ timeout: { hours: 24 } }
);
}
// Step 3: finalize (checkpointed)
const finalized = await context.step("finalize-decision", async () => {
// Write to DynamoDB, emit EventBridge event, etc.
return {
applicationId: scored.applicationId,
riskScore: scored.riskScore,
...decision,
finalizedAt: new Date().toISOString()
};
});
return finalized;
});Note that this is a single Lambda handler that will be replayed on resume, skipping completed durable operations using checkpoint data. The callback token (“callbackId”) is completed via Lambda callback APIs (SendDurableExecutionCallbackSuccess / Failure).
2) Approver callback function (calls Lambda callback APIs)
Durable callback completion APIs:
Success:
SendDurableExecutionCallbackSuccesstakes aCallbackId(URI) and a binaryResultpayload.Failure:
SendDurableExecutionCallbackFailuretakes aCallbackIdand error fields likeErrorType,ErrorMessage, etc.
// durableApproverCallback.js
import {
LambdaClient,
SendDurableExecutionCallbackSuccessCommand,
SendDurableExecutionCallbackFailureCommand
} from "@aws-sdk/client-lambda";
const lambda = new LambdaClient({});
export const handler = async (event) => {
const { callbackId, decision, notes } = event;
if (!callbackId) throw new Error("Missing callbackId");
if (decision === "APPROVE") {
const resultObj = { decision: "APPROVE", notes: notes ?? null };
// API expects binary payload; SDK accepts Uint8Array/Buffer for "Result"
const resultBytes = Buffer.from(JSON.stringify(resultObj), "utf-8");
await lambda.send(
new SendDurableExecutionCallbackSuccessCommand({
CallbackId: callbackId,
Result: resultBytes
})
);
return { ok: true };
}
await lambda.send(
new SendDurableExecutionCallbackFailureCommand({
CallbackId: callbackId,
ErrorType: "RejectedByReviewer",
ErrorMessage: notes ?? "No notes"
// ErrorData / StackTrace optional
})
);
return { ok: true };
};Pricing for Lambda Durable Functions
These values are current as of 2026-02-21, and are for us-east-1. Always check the pricing page.
Lambda durable functions have three pricing components:
Lambda compute + requests (same as normal Lambda): You pay for requests and duration, at for example $0.0000133334 per GB-second for ARM, plus $0.20 per million requests. The free tier includes 1M requests/month and 400,000 GB-seconds/month.
Durable operations: Each durable operation (execution start, steps, waits, etc.) is metered; the SDK doc shows how operations count by operation type (e.g.,
Execution Startedis 1 op;Stepis 1 + retries;WaitForCallbackis 3 + retries). You pay $8.00 per million operations.Data written (GB) and data retained (GB-month): Durable operations persist checkpoints; you pay for the data written and for retained storage over time (prorated GB-month). You pay $0.25 per GB written and $0.15 per GB-month of data retained.
Conclusion
Durable functions are a strong fit when you need a long-running workflow but want to stay in Lambda’s programming model.
When to use durable functions
Situation | Use Lambda durable functions? | Prefer instead | Why |
|---|---|---|---|
One Lambda needs to pause for hours/days (waiting on a human/external system) | Yes | — | Callbacks/waits checkpoint and resume across invocations. |
Workflow spans many AWS services and you want orchestration outside code | Maybe | Step Functions | Service-native orchestration/control plane, integrations. |
Event source mapping (SQS/Streams) with >15 minute processing time | Not with direct invocation | Intermediary + durable | Event source mapping target must stay within 15 minutes; durable functions time out at 15 minutes. |
You can’t keep code deterministic across replay (frequent hotfixes, nondeterministic reads) | Probably not | Step Functions or saga | Replay requires determinism. Use versions/aliases. |
Workflow risks hitting 3,000 ops / 100 MB written per execution | No | Re-architect steps or use Step Functions | Those are hard constraints per execution. |
