- Simple AWS
- Posts
- Complex, multi-step workflow with AWS Step Functions
Complex, multi-step workflow with AWS Step Functions
Using AWS Step Functions to build a complex workflow to process images, including advanced strategies to optimize the Step Functions workflow
Imagine you're building a social network app. Users will be able to upload images to an S3 bucket, and you need to first analyze them to detect inappropriate content. If the image is safe (does not contain inappropriate content), it will be resized and stored in another S3 bucket. If the image is unsafe (contains inappropriate content), the user that uploaded it will be notified by email.
We're going to use the following AWS services:
Amazon S3: Storing the uploaded images and the processed images.
AWS Lambda: Serverless processing.
AWS Step Functions: It's a serverless orchestration service that lets you integrate with AWS Lambda functions and other AWS services. Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.
Amazon Rekognition: A service that analyzes images and identifies objects, people, text, etc. It can detect inappropriate content (which is our use case for this scenario), do highly accurate facial analysis, face comparison, and face search. This is not really the focus of this issue.
Amazon SES: An email-sending service by AWS. You could also use Amazon SNS.
Example of Amazon S3 triggering an AWS Step Functions workflow
Creating a Multi-Step Workflow With AWS Step Functions
Create an S3 bucket to store the user-uploaded images.
Enable S3 Event Notification with EventBridge
Create a Lambda function to analyze images using Amazon Rekognition's image moderation API.
Create a Step Functions State Machine to coordinate the image analysis and processing.
In the State Machine, configure a Task to trigger the Lambda function to analyze the image using Rekognition.
Configure a Choice State to analyze the output of the Lambda function, which can be safe (the image did not contain inappropriate content) or unsafe (the image did contain inappropriate content).
Configure another Task that will be executed if the image is unsafe, which will trigger a Lambda function that will call SES to notify the user that their image was unsafe.
Configure another Task that will be executed if the image is safe, which will trigger a Lambda function that will resize the image and store it in another S3 bucket.
Create an Amazon EventBridge Rule on the S3 bucket to trigger the State Machine every time an image is uploaded to the S3 bucket.
Test the workflow with sample safe and unsafe images.
Understanding AWS Step Functions Workflows
What we're doing here is called orchestrating (micro)services. Every task (analyze the image, send an email, resize the image) is a service, and they need to interact in a certain order, with a certain logic. There's actually 3 ways to achieve this:
Each service calls the next one: This means you're adding on every service the responsibility of knowing who goes next in the workflow. You're coupling one service to the next one (and actually to the previous one as well, for handling rollbacks), and you're coupling every service to this specific workflow. More than that, you're adding an additional responsibility to every service. Our example is not that complex, but in real, complex workflows this will slow you down a lot.
Orchestrated services: Every service has a single responsibility (e.g. resize the image), and some external (centralized) controller stores and executes all the coordination logic, calling every service in the right order and passing around the responses. You are here. Step Functions is our Orchestrator in this case. Our example is really simple, but the main advantage of orchestrated services is that you're centralizing the definition of the workflow and making it easier to implement really complex stuff.
Choreographed services: Every service has a single responsibility (e.g. resize the image). It's subscribed to an event, and publishes an event when done. Every service only knows "If X happens, I do Y and post Z", without any knowledge of who causes X or who's listening for Z. The workflow logic is split across all services, but not tied to each service's implementation. Instead, the workflow logic emerges from watching all the services and understanding the end result of posting a message. The main advantages are that you don't depend on a centralized orchestrator and you're vendor agnostic (but that's not as important as we tend to think). They're not really harder to design than orchestrated services, the real disadvantage is that they're harder to keep track of. Here's an example of choreographed services.
Advanced Strategies and Best Practices for AWS Step Functions
Operational Excellence
Use Infrastructure as Code: Write your workflow as code. You can use CloudFormation, CDK, SAM, Terraform or any other tool. Creating the workflow for the first time is easier to do manually, but keeping track of later changes is ridiculously hard.
Use CI/CD: Don't manually update the code. That's messy enough in monoliths, but when working with multiple services that are called in a complex order, it gets outright impossible to manage. Use a CI/CD pipeline.
Automate testing: You're already defining your workflow as code, and deploying it automatically. Write a few end-to-end tests with sample images, so you at least cover the happy paths.
Logging and Monitoring: Set up logging and monitoring for Step Functions. Also, set up AWS X-Ray.
Security
Use IAM roles for Lambda: As always, minimum permissions. Use IAM roles for your Lambda functions. Don't use the same role for all functions though: The function that just calls Rekognition doesn't need to write to S3!!
Use IAM roles for Step Functions: Step Functions should also have minimum permissions. You can achieve that with IAM roles for Step Functions.
Restrict who can upload images: Restrict access to the S3 bucket for uploaded images using presigned URLs.
Restrict who can read and write images: Limit what IAM roles can read from the uploaded images bucket and write to the processed images bucket. Hint: These should be your Lambdas' roles.
If using CloudFront, use OAC: Origin Access Control lets you have CloudFront read from a private S3 bucket. That way, users can only access the images through CloudFront.
Reliability
Implement error handling and retry logic: Step Functions can handle errors and retries, for example for Lambda functions. Don't just design for the happy path, ensure that the workflow can recover from failures.
Pick an Async Express workflow: There's sync and async Express workflows. Sync are called at most once, Async are called at least once. You're dealing with async events, where you don't actually control the caller (which is S3 Events) to control your wait and retry logic. Async is the correct choice here. For reference, Standard workflows (see the Cost Optimization section) are called exactly once.
Make your steps idempotent: Idempotency means the final result is the same whether you call the function once or N times. Since async workflows are called at least once, you want to make sure calling the same Lambda twice for the same image doesn't result in data corruption.
Performance Efficiency
Optimize the Lambda functions: Check out 20 Advanced Tips for AWS Lambda.
Use Amazon CloudFront to serve the images: CloudFront is a CDN. Basically, it stores the images in a cache near the user (there's lots of locations around the world), and serves requests from there. Faster and cheaper!
Consider compressing images before uploading: This one's a clear tradeoff. On one hand, uploading will be faster. On the other hand, you'll need to uncompress the images to process them (and pay for that extra processing time). Uncompressing can be added easily as an additional Task in your Step Functions State Machine. Faster uploads for a better user experience, at a higher cost. Use this if you expect users to upload from slow networks such as 4G and the rest of the app works really well. If the rest of the app is slow, start optimizing there. If users typically upload from a 300 Mbps wifi connection, they won't even notice the improvement.
Send the S3 object ARN, not the image: We're talking about a big image (hence why we want to compress it!). That's a huge payload for Step Functions. Instead of sending the image itself, send the S3 object ARN and let each step read the image from S3.
Cost Optimization
Use an Express Workflow: There's Standard and Express workflows in Step Functions. Standard is for long running, Express is for high throughput (and is cheaper). Maximum runtime for Express workflows is 5 minutes (enough for our example), if you need more you can nest an Express workflow inside a Standard one.
Transition infrequently accessed objects to S3 Infrequent Access: In the scenario section I mentioned a social network. How often are old images accessed in a social network? You can set a lifecycle rule to transition objects to S3 Infrequent Access, where storage is cheaper and reads are more expensive. If you did the math right, you get a lower average cost. The math: If objects are accessed less than once a month, it's cheaper. And if you can't find any obvious patterns, you can use S3 Intelligent-Tiering.
Set up provisioned concurrency for Lambdas: If you know you're going to have a minimum of executions, setting up provisioned concurrency will save you some money.
Get a Savings Plan: Savings Plans are upfront commitments (with optional upfront pay) for compute resources. You'd typically link them to EC2 or Fargate, but they apply to Lambdas as well!
Consider going serverful: If your workflow has a baseline of constant traffic and processing can wait a few minutes, consider replacing some Lambda functions for servers (EC2 with auto scaling, or ECS, possibly with Fargate).
Did you like this issue? |
Reply