You usually use Step Functions when a workflow's complexity outgrows a single Lambda invocation, in terms of duration or of complexity. Lambda durable functions are another path, announced during re:Invent 2025. You write a Lambda handler, and you use a durable execution SDK that checkpoints after each durable operation and replays your code across multiple invocations so the execution can span a much longer duration: from a few minutes to up to one year.

Essentially it's a Lambda programming model + SDK (JavaScript/TypeScript and Python) that provides a DurableContext with durable operations (steps, waits, callbacks, parallel/map, child contexts). Each durable execution persists a checkpoint log and resumes by replaying from the top and skipping completed operations.

You enable durable execution when creating the Lambda function (you can’t retrofit an existing one), by setting DurableConfig (timeout + retention) via console/CLI/API/IaC; then you deploy code that uses the SDK wrapper/decorator. Then you invoke the function (typically asynchronously) using a qualified version/alias, monitor progress via CloudWatch metrics and EventBridge status-change events, inspect/stop executions via durable execution management commands, and complete callbacks via dedicated Lambda APIs.

Lambda Durable Execution Lifecycle and Replay

A durable execution is a logical run of your workflow that may span many Lambda invocations. The SDK checkpoints after each durable operation, and when the function resumes it replays your handler from the beginning while substituting stored results for already-completed operations. Basically, you write your orchestration code so that anything non-deterministic or side-effecting happens inside durable operations (especially step).

The Four Phases of Lambda Durable Functions

  1. Start: A durable execution begins when you invoke the function with durable execution enabled.

  2. Checkpoint: Each durable operation (e.g., step, wait, callback) causes a checkpoint record to be persisted before the execution moves on.

  3. Suspend and resume: Waits/callbacks intentionally end the current Lambda invocation and schedule/trigger a later resume. Lambda invokes the function again later and the SDK replays to the suspension point.

  4. Complete: When your handler returns, the execution completes. Monitoring surfaces succeed/failed/timed out outcomes.

Replay is not a “resume from line N”

Replay means your code runs again from the top, but durable operations short-circuit by returning stored results. Determinism matters a lot in those operations: if you read Date.now(), random IDs, or environment values that change between runs outside a durable operation, you can diverge during replay and hit nondeterminism errors.

Here's a code example using JavaScript:

import { withDurableExecution, DurableContext } from "@aws/durable-execution-sdk-js";

export const handler = withDurableExecution(async (event: any, ctx: DurableContext) => {
  // Pure orchestration logic here (replay-safe).
  const input = event.input;

  const validated = await ctx.step("validate", async () => {
    // Do side effects and nondeterministic reads inside steps.
    return { ok: true, input };
  });

  if (!validated.ok) return { status: "rejected" };

  await ctx.wait("cooldown", { seconds: 30 });

  return { status: "approved" };
});

Now your function is a deterministic program that can be re-run, and durable ops are the boundaries where results are persisted and re-used.

Anyshift, the AI engineer that shows you exactly what broke and why

Production incidents aren't hard because systems are complex. They're hard because nobody has the full picture when things break, and your team ends up burning 45 minutes just rebuilding context that should already exist somewhere.

You could ask AI, but generic copilots don't know your stack. You'll spend more time copypasting logs and filtering through guesses than doing actual root cause analysis.

Unlike other AI SREs, Anyshift automatically maps your cloud, K8s, code, and observability data into a live dependency graph. When something breaks, it already knows how everything connects.

  • Root cause in minutes: Solve incidents with full context.

  • Noise reduction: P3/P4 alerts are auto-triaged before they hit your phone.

  • Proactive safety: Risks are caught before they become incidents.

With security and auditability built into the foundation (SOC 2 Type II certified), it’s a 5-minute setup with no lock-in.

Get started at Anyshift.io

Lambda Durable Operations: steps, waits, callbacks

Durable functions give you three core primitives: steps (checkpointed work with retries), waits (pause without compute billing), and callbacks (pause until an external system responds). You access them via DurableContext methods. Design-wise, you should put all side effects in steps, and use waits/callbacks to “sleep” without tying up a Lambda invocation.

Steps

A step runs code and records its result to the checkpoint log. On replay, completed steps return the stored result instead of re-running. Steps can be retried, which implies they can run more than once unless you configure stricter semantics.

Waits

A wait checkpoints, ends the current invocation, and schedules resumption later. You use waits for backoff, human approval timeouts, polling intervals, etc. The key here is that you stop paying for Lambda compute while waiting.

Callbacks

Callbacks give you a callback ID you can hand to an external system. Your durable execution suspends until that system calls the Lambda callback completion APIs.

The callback ID is the primary identifier in the callback completion APIs (URI path parameter). Treat it as sensitive data and scope who can use it.

Rules of Thumb

  • If the code does I/O or side effects, put it in step and make it idempotent.

  • If the code needs to wait for some period of time, use wait and don’t poll inside a single Lambda invocation.

  • If the code needs to wait for some external call, use callback + completion APIs.

Invocation modes and idempotency keys

Durable functions are designed to be invoked asynchronously, so the caller doesn’t block while the workflow may run for minutes, hours, or days. To prevent duplicate starts, use an execution name as an idempotency key. If you invoke again with the same name, Lambda rejects the duplicate with a durable-execution-already-exists error.

You can supply an execution name when invoking a Lambda Durable function. If an execution with that name is already running or completed, the service rejects the duplicate start with a DurableExecutionAlreadyExists error.

Step Idempotency vs Start Idempotency

Start idempotency prevents duplicate workflow starts. You set it via the execution name. Step idempotency prevents duplicate side effects inside the workflow, and you set it by using step execution mode plus external idempotency keys.

If duplicates come from the event source or caller retries, enforce execution name. That way repeated attempts to trigger the same workflow for the same reason (e.g. in response to the same event) will fail correctly.

If duplicates come from retries or timeouts within a step, use AT_MOST_ONCE_PER_RETRY and external idempotency tokens. This will avoid having your steps accidentally trigger the same behavior twice.

If you can’t define a stable idempotency key, Lambda Durable functions will just end up being very complex, and you should consider Step Functions with explicit task tokens, or a saga pattern for event-driven architectures.

Lambda Durable Functions vs Step Functions

AWS Lambda durable functions are a Lambda execution mode plus a Durable Execution SDK that adds durable primitives (steps, waits, callbacks, etc.) to a single Lambda handler. Your code is replayed from the start on resume, and completed durable operations are skipped using stored checkpoints.

AWS Step Functions is a separate orchestration service where you define a state machine (ASL JSON) and run executions. You wire tasks to AWS services (including Lambda) and get built-in workflow history + visualization.

Execution model and guarantees

In Step Functions, your source of truth is the state machine graph (ASL). Each state transition is managed by Step Functions, which records progress and drives task scheduling. Pricing and limits are centered around state transitions / executions.

Standard workflows guarantee exactly-once workflow execution, and can run up to one year. Express workflows guarantee at-least-once workflow execution, and can run up to five minutes.

For Lambda Durable Functions, your source of truth is the Lambda handler code, but durable operations create checkpoints. On resume, the SDK replays the handler from the beginning and returns stored results for previously completed durable operations. This creates an important constraint: your code must be deterministic across replays (especially any logic outside durable operations). The SDK explicitly warns about replay behavior and determinism.

Waits and human-in-the-loop

Step Functions uses patterns like Wait states or callback (“task token”) patterns. In the callback pattern, Step Functions hands you a taskToken, and an external actor later resumes the workflow by calling APIs like SendTaskSuccess (or the failure variant).

Lambda durable functions, on the other hand, uses durable operations like wait() and callback primitives like createCallback() / waitForCallback(). When waiting, the invocation terminates and later resumes without on-demand compute billed during the wait. External systems resume via Lambda APIs SendDurableExecutionCallbackSuccess / SendDurableExecutionCallbackFailure.

Integrations

Step Functions has native integrations that let you orchestrate many AWS services without creating Lambda functions just to call a service. You also get visual workflow debugging/history (though in my opinion the UI is pretty ugly).

With Lambda Durable Functions you can integrate with anything via AWS SDK calls inside steps (or use context.invoke() to call other Lambda functions as durable operations), which is more natural if you're more used to code. However, you’re building the control flow in code and you won’t get Step Functions’ state-machine-level service integration patterns that would let you just call a service without writing the code for that call.

Cost model

In Step Functions Standard workflows you pay per state transition (retries count too). The free tier includes 4,000 transitions/month (does not expire), and in us-east-1 you pay $0.000025 per transition. Step Functions Express workflows make you pay per request + duration (rounded to 100 ms) + memory used (billed in 64 MB chunks). In us-east-1 you'll pay $1.00 per million requests and $0.00001667 per GB-second (plus tiering at higher GB-hours).

With Lambda Durable Functions you still pay normal Lambda requests + duration (including sub-invocations due to resume/replay), plus durable charges:

  • Durable operations: $8.00 per million operations.

  • Data written by durable operations (GB): $0.25 per GB.

  • Data retained (GB-month, prorated): $0.15 per GB-month.

Decision guidance

Use Step Functions when you need:

  • A workflow as a first-class managed artifact (ASL + visualization + execution history).

  • Broad AWS-service orchestration with less “glue code” (especially across many services).

  • Standard’s semantics and tooling for long-running orchestration (up to 1 year) with exactly-once workflow execution.

Use Lambda durable functions when you need:

  • A code-first workflow where the orchestration logic lives in the Lambda handler and you want to use durable primitives (step, wait, waitForCallback, parallel, etc.) with checkpoint+replay.

  • To wait without paying for idle compute, and prefer to stay inside the Lambda programming model, and you’re willing to design for determinism/replay.

  • Therapy after dealing with Amazon States Language (ASL).

Building a Human Approval Workflow in Step Functions and Lambda Durable Functions

Let's see an example. Say our app needs to process a loan application with the following process:

  1. Risk scoring (fast, deterministic-ish compute + external call)

  2. If risk is low, approve automatically

  3. Else, request human review and pause until the reviewer approves/rejects

  4. Finalize decision (write to DB / emit event)

This example focuses on the “pause + resume via external callback” pattern because it shows the biggest mechanical difference between Step Functions and Lambda Durable Functions.

Step Functions implementation (Standard workflow + task token callback)

1) State machine definition (ASL JSON)

{
  "Comment": "Loan approval with human-in-the-loop review (Standard)",
  "StartAt": "RiskScore",
  "States": {
    "RiskScore": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${RiskScoreLambdaArn}",
        "Payload.$": "$"
      },
      "OutputPath": "$.Payload",
      "Next": "RiskDecision"
    },
    "RiskDecision": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.riskScore",
          "NumericLessThanEquals": 30,
          "Next": "AutoApprove"
        }
      ],
      "Default": "RequestHumanApproval"
    },
    "AutoApprove": {
      "Type": "Pass",
      "Parameters": {
        "decision": "APPROVE",
        "reason": "Auto-approved by risk threshold",
        "application.$": "$"
      },
      "Next": "FinalizeDecision"
    },
    "RequestHumanApproval": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "TimeoutSeconds": 86400,
      "Parameters": {
        "FunctionName": "${RequestApprovalLambdaArn}",
        "Payload": {
          "taskToken.$": "$$.Task.Token",
          "application.$": "$"
        }
      },
      "OutputPath": "$.Payload",
      "Next": "FinalizeDecision"
    },
    "FinalizeDecision": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "${FinalizeLambdaArn}",
        "Payload.$": "$"
      },
      "OutputPath": "$.Payload",
      "End": true
    }
  }
}

2) riskScore Lambda

// riskScore.js
export const handler = async (event) => {
  // event: { applicationId, applicant, amount, ... }
  // Keep this deterministic-ish; any nondeterminism should be inside the Lambda call itself (Step Functions won't replay your code).
  const riskScore = Math.min(100, Math.max(0, (event.amount ?? 0) / 1000)); // placeholder logic

  return {
    ...event,
    riskScore,
    scoredAt: new Date().toISOString()
  };
};

3) requestApproval Lambda (sends token to reviewer channel)

This Lambda receives taskToken and must deliver it to your approver UI/system. Step Functions will wait until something calls SendTaskSuccess or SendTaskFailure for that token.

// requestApproval.js
// This function is invoked by Step Functions with a taskToken.
// It should NOT call SendTaskSuccess itself unless you're auto-approving.
// It should send the token to your approval system (email, Slack, ticket, etc.)

export const handler = async (event) => {
  const { taskToken, application } = event;

  // In production: publish to SNS/SQS/EventBridge, create a ticket, notify Slack, etc.
  console.log("Approval requested", {
    applicationId: application.applicationId,
    taskTokenPreview: taskToken.slice(0, 16) + "..."
  });

  // Step Functions "waitForTaskToken" pattern keeps the state open until callback.
  // Return quickly (this Lambda call is just the submission step).
  return {
    status: "PENDING_REVIEW",
    applicationId: application.applicationId
  };
};

4) “Approver callback” Lambda (calls SendTaskSuccess / SendTaskFailure)

Mechanics:

  • It receives { taskToken, decision, notes } (e.g., from API Gateway).

  • On approve, call SendTaskSuccess with output JSON.

  • On reject, call SendTaskFailure.

// approverCallback.js
import { SFNClient, SendTaskSuccessCommand, SendTaskFailureCommand } from "@aws-sdk/client-sfn";

const sfn = new SFNClient({});

export const handler = async (event) => {
  const { taskToken, decision, notes } = event;

  if (!taskToken) throw new Error("Missing taskToken");

  if (decision === "APPROVE") {
    const output = JSON.stringify({ decision: "APPROVE", notes: notes ?? null });
    await sfn.send(new SendTaskSuccessCommand({ taskToken, output }));
    return { ok: true };
  }

  // Reject path: you can send structured error info.
  await sfn.send(
    new SendTaskFailureCommand({
      taskToken,
      error: "RejectedByReviewer",
      cause: notes ?? "No notes"
    })
  );
  return { ok: true };
};

Lambda durable functions implementation (single handler + SDK callback)

0) What you need in code

You wrap your handler with the durable wrapper and use DurableContext operations.

npm install @aws/durable-execution-sdk-js

1) Durable workflow handler (steps + waitForCallback)

This version:

  • Calls context.step() for risk scoring and finalization

  • Uses context.waitForCallback() to create a callback ID, run the submitter function, and then block until callback completes (or times out)

// durableLoanWorkflow.js
import { withDurableExecution } from "@aws/durable-execution-sdk-js";

/**
 * event: { applicationId, amount, applicant, ... }
 */
export const handler = withDurableExecution(async (event, context) => {
  // Step 1: risk score (checkpointed)
  const scored = await context.step("risk-score", async () => {
    const riskScore = Math.min(100, Math.max(0, (event.amount ?? 0) / 1000)); // placeholder
    return { ...event, riskScore, scoredAt: new Date().toISOString() };
  });

  let decision;
  if (scored.riskScore <= 30) {
    decision = { decision: "APPROVE", notes: "Auto-approved by threshold" };
  } else {
    // Step 2: human approval via callback
    // waitForCallback creates callbackId, runs submitter (send request), then waits.
    decision = await context.waitForCallback(
      "human-approval",
      async (callbackId) => {
        // In production: publish callbackId + context to SNS/SQS/EventBridge, create ticket, etc.
        console.log("Approval requested", {
          applicationId: scored.applicationId,
          callbackIdPreview: callbackId.slice(0, 16) + "..."
        });
      },
      { timeout: { hours: 24 } }
    );
  }

  // Step 3: finalize (checkpointed)
  const finalized = await context.step("finalize-decision", async () => {
    // Write to DynamoDB, emit EventBridge event, etc.
    return {
      applicationId: scored.applicationId,
      riskScore: scored.riskScore,
      ...decision,
      finalizedAt: new Date().toISOString()
    };
  });

  return finalized;
});

Note that this is a single Lambda handler that will be replayed on resume, skipping completed durable operations using checkpoint data. The callback token (“callbackId”) is completed via Lambda callback APIs (SendDurableExecutionCallbackSuccess / Failure).

2) Approver callback function (calls Lambda callback APIs)

Durable callback completion APIs:

  • Success: SendDurableExecutionCallbackSuccess takes a CallbackId (URI) and a binary Result payload.

  • Failure: SendDurableExecutionCallbackFailure takes a CallbackId and error fields like ErrorType, ErrorMessage, etc.

// durableApproverCallback.js
import {
  LambdaClient,
  SendDurableExecutionCallbackSuccessCommand,
  SendDurableExecutionCallbackFailureCommand
} from "@aws-sdk/client-lambda";

const lambda = new LambdaClient({});

export const handler = async (event) => {
  const { callbackId, decision, notes } = event;
  if (!callbackId) throw new Error("Missing callbackId");

  if (decision === "APPROVE") {
    const resultObj = { decision: "APPROVE", notes: notes ?? null };
    // API expects binary payload; SDK accepts Uint8Array/Buffer for "Result"
    const resultBytes = Buffer.from(JSON.stringify(resultObj), "utf-8");

    await lambda.send(
      new SendDurableExecutionCallbackSuccessCommand({
        CallbackId: callbackId,
        Result: resultBytes
      })
    );

    return { ok: true };
  }

  await lambda.send(
    new SendDurableExecutionCallbackFailureCommand({
      CallbackId: callbackId,
      ErrorType: "RejectedByReviewer",
      ErrorMessage: notes ?? "No notes"
      // ErrorData / StackTrace optional
    })
  );

  return { ok: true };
};

Pricing for Lambda Durable Functions

These values are current as of 2026-02-21, and are for us-east-1. Always check the pricing page.

Lambda durable functions have three pricing components:

  1. Lambda compute + requests (same as normal Lambda): You pay for requests and duration, at for example $0.0000133334 per GB-second for ARM, plus $0.20 per million requests. The free tier includes 1M requests/month and 400,000 GB-seconds/month.

  2. Durable operations: Each durable operation (execution start, steps, waits, etc.) is metered; the SDK doc shows how operations count by operation type (e.g., Execution Started is 1 op; Step is 1 + retries; WaitForCallback is 3 + retries). You pay $8.00 per million operations.

  3. Data written (GB) and data retained (GB-month): Durable operations persist checkpoints; you pay for the data written and for retained storage over time (prorated GB-month). You pay $0.25 per GB written and $0.15 per GB-month of data retained.

Conclusion

Durable functions are a strong fit when you need a long-running workflow but want to stay in Lambda’s programming model.

When to use durable functions

Situation

Use Lambda durable functions?

Prefer instead

Why

One Lambda needs to pause for hours/days (waiting on a human/external system)

Yes

Callbacks/waits checkpoint and resume across invocations.

Workflow spans many AWS services and you want orchestration outside code

Maybe

Step Functions

Service-native orchestration/control plane, integrations.

Event source mapping (SQS/Streams) with >15 minute processing time

Not with direct invocation

Intermediary + durable

Event source mapping target must stay within 15 minutes; durable functions time out at 15 minutes.

You can’t keep code deterministic across replay (frequent hotfixes, nondeterministic reads)

Probably not

Step Functions or saga

Replay requires determinism. Use versions/aliases.

Workflow risks hitting 3,000 ops / 100 MB written per execution

No

Re-architect steps or use Step Functions

Those are hard constraints per execution.

Login or Subscribe to participate

Reply

Avatar

or to participate

Recommended for you