Complex, multi-step workflow with AWS Step Functions
Using AWS Step Functions to build a complex workflow to process images, including advanced strategies to optimize the Step Functions workflow
Welcome to Simple AWS! A free newsletter that helps you build on AWS without being an expert. This is issue #13, where we'll discuss a similar problem to the one from our last issue, but with more complex steps that need a different solution. Shall we?
Use case: Complex, multi-step image processing workflow with AWS Step Functions
AWS Services involved: Step Functions, Lambda, S3, Rekognition, SES
You are building a social network app. Users will be able to upload images to an S3 bucket, and you need to first analyze them to detect inappropriate content. If the image is safe (does not contain inappropriate content), it will be resized and stored in another S3 bucket. If the image is unsafe (contains inappropriate content), the user that uploaded it will be notified by email.
- S3: Storing the uploaded images and the processed images.
- Lambda: Serverless processing.
- Step Functions: It's a serverless orchestration service that lets you integrate with AWS Lambda functions and other AWS services. Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.
- Rekognition: A service that analyzes images and identifies objects, people, text, etc. It can detect inappropriate content (which is our use case for this scenario), do highly accurate facial analysis, face comparison, and face search. This is not really the focus of this issue.
- SES: An email-sending service by AWS. You could also use SNS. This is not really the focus of this issue.
- Create an S3 bucket to store the user-uploaded images.
- Enable S3 Event Notification with EventBridge
- Create a Lambda function to analyze images using Amazon Rekognition's image moderation API.
- Create a Step Functions State Machine to coordinate the image analysis and processing.
- In the State Machine, configure a Task to trigger the Lambda function to analyze the image using Rekognition.
- Configure a Choice State to analyze the output of the Lambda function, which can be safe (the image did not contain inappropriate content) or unsafe (the image did contain inappropriate content).
- Configure another Task that will be executed if the image is unsafe, which will trigger a Lambda function that will call SES to notify the user that their image was unsafe.
- Configure another Task that will be executed if the image is safe, which will trigger a Lambda function that will resize the image and store it in another S3 bucket.
- Create an Amazon EventBridge Rule on the S3 bucket to trigger the State Machine every time an image is uploaded to the S3 bucket.
- Test the workflow with sample safe and unsafe images.
What we're doing here is called orchestrating (micro)services. Every task (analyze the image, send an email, resize the image) is a service, and they need to interact in a certain order, with a certain logic. There's actually 3 ways to achieve this:
- Each service calls the next one: This means you're adding on every service the responsibility of knowing who goes next in the workflow. You're coupling one service to the next one (and actually to the previous one as well, for handling rollbacks), and you're coupling every service to this specific workflow. More than that, you're adding an additional responsibility to every service. Our example is not that complex, but in real, complex workflows this will slow you down a lot.
- Orchestrated services: Every service has a single responsibility (e.g. resize the image), and some external (centralized) controller stores and executes all the coordination logic, calling every service in the right order and passing around the responses. You are here. Step Functions is our Orchestrator in this case. Our example is really simple, but the main advantage of orchestrated services is that you're centralizing the definition of the workflow and making it easier to implement really complex stuff.
- Choreographed services: Every service has a single responsibility (e.g. resize the image). It's subscribed to an event, and publishes an event when done. Every service only knows "If X happens, I do Y and post Z", without any knowledge of who causes X or who's listening for Z. The workflow logic is split across all services, but not tied to each service's implementation. Instead, the workflow logic emerges from watching all the services and understanding the end result of posting a message. The main advantages are that you don't depend on a centralized orchestrator and you're vendor agnostic (but that's not as important as we tend to think). They're not really harder to design than orchestrated services, the real disadvantage is that they're harder to keep track of. Our last issue was actually a ridiculously simple example of choreographed services.
- Use IaC: Write your workflow as code. You can use CloudFormation, CDK, SAM, Terraform or any other tool. Creating the workflow for the first time is easier to do manually, but keeping track of later changes is ridiculously hard.
- Use CI/CD: Don't manually update the code. That's messy enough in monoliths, but when working with multiple services that are called in a complex order, it gets outright impossible to manage. Use a CI/CD pipeline.
- Automate testing: You're already defining your workflow as code, and deploying it automatically. Write a few end-to-end tests with sample images, so you at least cover the happy paths.
- Logging and Monitoring: Set up logging and monitoring for Step Functions. Also, set up X-Ray.
- Use IAM roles for Lambda: As always, minimum permissions. Use IAM roles for your Lambda functions. Don't use the same role for all functions though: The function that just calls Rekognition doesn't need to write to S3!!
- Use IAM roles for Step Functions: Step Functions should also have minimum permissions. You can achieve that with IAM roles for Step Functions.
- Restrict who can upload images: Restrict access to the S3 bucket for uploaded images using presigned URLs.
- Restrict who can read and write images: Limit what IAM roles can read from the uploaded images bucket and write to the processed images bucket. Hint: These should be your Lambdas' roles.
- If using CloudFront, use OAC: Origin Access Control lets you have CloudFront read from a private S3 bucket. That way, users can only access the images through CloudFront.
- Implement error handling and retry logic: Step Functions can handle errors and retries, for example for Lambda functions. Don't just design for the happy path, ensure that the workflow can recover from failures.
- Pick an Async Express workflow: There's sync and async Express workflows. Sync are called at most once, Async are called at least once. You're dealing with async events, where you don't actually control the caller (which is S3 Events) to control your wait and retry logic. Async is the correct choice here. For reference, Standard workflows (see the Cost Optimization section) are called exactly once.
- Make your steps idempotent: Idempotency means the final result is the same whether you call the function once or N times. Since async workflows are called at least once, you want to make sure calling the same Lambda twice for the same image doesn't result in data corruption.
- Optimize the Lambda functions: Go back to our past issue about Lambda for 20 tips to optimize Lambda functions.
- Use CloudFront to serve the images: CloudFront is a CDN. Basically, it stores the images in a cache near the user (there's lots of locations around the world), and serves requests from there. Faster and cheaper!
- Consider compressing images before uploading: This one's a clear tradeoff. On one hand, uploading will be faster. On the other hand, you'll need to uncompress the images to process them (and pay for that extra processing time). Uncompressing can be added easily as an additional Task in your Step Functions State Machine. Faster uploads for a better user experience, at a higher cost. Use this if you expect users to upload from slow networks such as 4G and the rest of the app works really well. If the rest of the app is slow, start optimizing there. If users typically upload from a 300 Mbps wifi connection, they won't even notice the improvement.
- Send the S3 object ARN, not the image: We're talking about a big image (hence why we want to compress it!). That's a huge payload for Step Functions. Instead of sending the image itself, send the S3 object ARN and let each step read the image from S3.
- Use an Express Workflow: There's Standard and Express workflows in Step Functions. Standard is for long running, Express is for high throughput (and is cheaper). Maximum runtime for Express workflows is 5 minutes (enough for our example), if you need more you can nest an Express workflow inside a Standard one.
- Transition infrequently accessed objects to S3 Infrequent Access: In the scenario section I mentioned a social network. How often are old images accessed in a social network? You can set a lifecycle rule to transition objects to S3 Infrequent Access, where storage is cheaper and reads are more expensive. If you did the math right, you get a lower average cost. The math: If objects are accessed less than once a month, it's cheaper. And if you can't find any obvious patterns, you can use S3 Intelligent-Tiering.
- Set up provisioned concurrency for Lambdas: If you know you're going to have a minimum of executions, setting up provisioned concurrency will save you some money.
- Get a Savings Plan: Savings Plans are upfront commitments (with optional upfront pay) for compute resources. You'd typically link them to EC2 or Fargate, but they apply to Lambdas as well!
- Consider going serverful: If your workflow has a baseline of constant traffic and processing can wait a few minutes, consider replacing some Lambda functions for servers (EC2 with auto scaling, or ECS, possibly with Fargate).
There's a lot to unpack for this issue, both on orchestrating services as a general topic, and on Step Functions as an AWS service. For orchestrating services, check this blog post on the Saga pattern. For Step Functions, check The AWS Step Functions Workshop.
Workflow Studio is a low-code visual workflow designer for Step Functions. As you create a workflow, Workflow Studio validates your work and auto-generates code, which you can export for CloudFormation. Basically, visual generator for cfn code for Step Functions!
And while we're discussing visual generators, let's not forget about Application Composer (it's still in Preview). Have you tried it so far?
If getting AWS Certified is among your new year's resolutions, let me recommend Adrian Cantrill's courses. With their mix of theory and practice, they're the best I've seen. I've literally bought them all (haven't watched them all yet). <-- This recommendation contains affiliate links.
As you're probably aware, I offer consulting services in 2 ways: a 1-hour session to tackle anything that's bugging your about your AWS architecture, and a complete architecture design (or redesign). In addition to that, for Simple AWS subscribers, I'm offering 30-minute sessions, for free.
Some of the above resources are paid promotions or contain affiliate links. I only recommend resources I've tried for myself and found actually useful, regardless of whether I get paid for it or not.
I didn't really talk about Rekognition and SES, though I featured them as part of the solution. I only included them for completeness, but the real focus was Step Functions. Let me know if you want a future issue to deal with the actual image processing part (Rekognition can do a lot of cool stuff), with notifying users in different ways (SES, SNS and the different use cases), or with something else entirely. A deeper dive into microservices would be another interesting topic. Or we could get into Kubernetes.
What should be the topic for our next issue?
Thank you for reading! See ya on the next issue.