Architecting with AWS Lambda: Architecture Design
Part 1 of this series, Architecting with AWS Lambda: Architecture Concerns, dealt with architectural concerns regarding serverless in general and AWS Lambda in particular, but it was all theory (even if it's theory based on experience). Part 2 (this one) will deal with mostly the same concerns, but from a practical perspective. We're going to architect an application! I'll take you through the process of architecting a serverless solution, including making a few architecture decisions. Of course, we'll use Amazon Web Services.
Serverless Example: Blurring Faces in Photos
The application we're going to architect is a service that blurs faces of people in images uploaded to Amazon S3. Let's detail the requirements, so we're all talking about the same thing.
User accesses a website.
User uploads a photo.
Application displays the photo with all people's faces blurred.
The application is a website.
The application is public, Users don't need to register.
The face-blurring process needs to take a maximum of 1 minute.
Traffic is expected to fluctuate from 10 to 10.000 photos per minute.
Non-blurred photos can't be public.
Non-blurred photos don't need to be stored.
Blurred photos need to be publicly accessible via an URL.
Blurred photos need to be kept for 1 month.
A User doesn't need to view/list their photos.
I'm sure most of you already thought of a solution. That's awesome! But let me share with you my thought process, and how I arrive at my solution (which isn't necessarily better than yours).
Btw, I'm not going to dive deep into how to blur faces. Sorry if you were hoping for that.
Technical Constraints In Our Architecture
Let's begin by analyzing the functional requirements and translating them to technical constraints, which will limit the decisions that we can make:
The application is a website. -> Static website on S3 + CloudFront.
The application is public, Users don't need to register. -> No user management.
The face-blurring process needs to take a maximum of 1 minute. -> Can't do batch processing.
Traffic is expected to fluctuate from 10 to 10.000 photos per minute. -> Need to scale up and down really fast.
Non-blurred photos can't be public. -> Private S3 bucket
Non-blurred photos don't need to be stored. -> Face-blurring service deletes the photo.
Blurred photos need to be publicly accessible via an URL. -> Public S3 bucket or CloudFront.
Blurred photos need to be kept for 1 month. -> S3 Lifecycle Rule to delete photos after 1 month.
A User doesn't need to view/list their photos. -> No user management, User IDs, or additional metadata on the photos.
Making Architecture Decisions
The Obvious Parts
Let's start with the obvious: A web frontend on S3 + CloudFront, no user management, users upload photos to an S3 bucket (we use pre-signed URLs for that), and photos end up in the blurred photos S3 bucket. I'm saying "blurred photos S3 bucket" to refer to the bucket where the photos with the blurred faces will end up in, which will in turn sit behind a CloudFront distribution. Some things, like using Origin Access Control to let CloudFront access that bucket, are implementation details, not architecture decisions.
Triggering The Process
Moving on to a slightly more advanced aspect, have two options to trigger the face-blurring process. The first one is more traditional: The Frontend calls a backend service in a synchronous call, the backend (I'm intentionally not tying this to one service) blurs the faces, uploads the image to the blurred photos S3 bucket and returns the URL, which the Frontend displays.
The other option is to use Amazon S3 Event Notifications, which publishes an event to Amazon EventBridge when an object is uploaded to an S3 bucket. It makes the request asynchronous, so we'll need either a way for our backend to talk to our frontend (e.g. websockets) or a different user experience such as emailing the user the link.
To choose the best option, we'll need to clarify requirements a bit. I'll continue making them up for this example:
The face blurring process takes 30 seconds. -> This constraint limits the expectations of the user. We can leave them waiting on our website, or we can send them an email when the process finishes. Since our process is inherently long and the user doesn't expect an immediate response, both options are OK. Note: This is a completely made up assumption, I have no idea how long blurring faces actually takes, but let's assume it takes 30 seconds and the user expects that delay.
Since there seems to be no difference between having the user wait on our website and having them receive the result via email, I'll pick the second option: Trigger the process with S3 Event Notifications, which makes the call asynchronous.
Deciding How Many Services
Conceptually, the face-blurring process involves two actions: detecting faces and blurring them. Depending on the specifics on how to do that, it could be handled by two separate modules, or the same module. We could call an external service for any one of those actions, for each of them, or resolve both actions with just an API call (if we can find one that does it).
Let's assume we run them in separate Lambda functions. Since we're doing asynchronous calls, making the call from detectFaces to blurFaces synchronous doesn't make sense. We have two options: orchestrate the calls with an orchestrator such as AWS Step Functions, or choreograph the calls via events published to Amazon Simple Notifications Service (SNS).
Architecting Module Calls
Step Functions just adds complexity here, and with the calls being so simple and there being no retry or rollback logic whatsoever, there's no value in using it. SNS is simpler, so it sounds like the better option. However, since it's just one module calling the other one, with no retries, no rollbacks, and no expectations of this solution growing beyond this, the simplest solution is to just run both modules in the same AWS Lambda function.
By the way, did you notice I sneaked in three requirements?
The process can fail, and the user doesn't need to be alerted. -> No retry logic, no transaction status, no notification upon failure.
There's no need to undo anything upon a failure. -> No rollbacks.
The solution isn't expected to change. -> No future requirements that are known at the moment.
When I say I sneaked them in, I don't mean me as the author of this article, I mean me as the architect of this solution. And when I say requirements I actually mean constraints. It should be obvious that in a real system we should go back to the requirements gathering process and validate those assumptions. What's probably not obvious is the fact that our design is introducing those constraints: If we don't catch the constraints that we're introducing, we won't even think about going back to ask those questions, and we'll find out late, when things are more expensive to change.
Uploading The Result
Once we have our blurred photo, we need to upload it to the blurred photos bucket. This is an implementation detail, and should be easy enough, so I won't delve on it from an architecture perspective. Obviously the Lambda function needs an IAM Role with permissions to write to that S3 bucket (don't use credentials in your code!), but even that is an implementation detail, though it's good to have a reminder.
Deleting The Original Photo
After the blurred photo is uploaded, we need to delete the original photo from the uploads bucket. The code to delete it is also an implementation detail, but this presents another architectural constraint (or rather another constraint that our architecture introduced): If the process fails, the original photo won't be deleted.
Naively, we could think of a
try... catch... finally block, which would ensure the photo is deleted if there's an exception mid-process. However, we also need to consider the Lambda execution environment failing. The chances are slim, but non-zero, so we need to think about what would happen if some photos pile up on the uploads bucket.
Notice that I said we need to think about it, not solve it. If we sneak in yet another requirement, that the system won't be used by a ton of users (we could say 30,000 users a month if you need a number), then we can conclude that we'll end up with very, very few undeleted photos, so few that we won't even notice them. If we were talking about 30 million users a month it would be worth doing something simple like a lifecycle rule to delete old objects in the uploads bucket, but for 30,000 users a month it's not even worth it to do that. Constraint solved by intentionally deciding that it's not a problem, which is not the same as not even noticing that we introduced it.
Notifying The User
Once the process completes, we need to let the user know. This is critical, since (because of our UX design) the user has no way of knowing the URL of their blurred photo unless we give it to them. We'll solve it the easy way: Use Amazon SES to send an email to the user. We could also do websockets to update the frontend, but I'll leave that for another post.
Analyzing Our Serverless Architecture
This is the architecture that we ended up with:
Frontend: A web frontend in S3+CloudFront
Uploads Bucket: An S3 bucket where the frontend uploads files using Signed URLs
Face-blurring Function: A Lambda function triggered when an object is uploaded to that bucket, which has two modules (.js files): detectFaces calls Rekognition to get the coordinates of the faces, blurFaces blurs them. The function then uploads the blurred photo to another S3 bucket, deletes the original photo in the uploads bucket, and notifies the user via email using SES.
Rekognition: to get the coordinates of the faces
SES: to notify the user
Blurred Photos Bucket: An S3 bucket with the blurred photos, served through CloudFront
Here's a picture:
Architecture diagram of a serverless solution to blur faces
And since the process of going from uploaded photo to blurred photo needs some explanation, this other diagram is valuable:
Sequence diagram of a serverless solution to blur faces
The Role of the Architect
If you pull up a definition of software architecture, you could argue that some of the decisions we've made aren't architecture decisions but design decisions. And you would win that argument! For example, deciding to put the detectFaces and blurFaces code in separate modules is 100% software design. Using SES is not even that, it's an implementation detail, and a pure architecture would just say "some email service" (actually, who's to say SES isn't an acronym for Some Email Service?).
You can use a dictionary to draw the line between architecture and design, or you can use common sense. In this specific (and entirely imaginary) case where I'm in the role of architecting, designing and probably even implementing the system, common sense dictates that I should be making and documenting these decisions, because I'm the smartest person in this project (by virtue of being the only person in this project).
If you're a senior engineer or tech lead in a team/org with no architects (really common in startups and small companies), you'll likely find yourself making decisions like these. If you're an architect, you might want to tread a bit more carefully, since the lines might not be as blurry, and you might be stepping into someone else's responsibilities. Ideally you'd just communicate and collaborate, since building software is a team effort. However, I think it's worth acknowledging that I've gone beyond just architecture in this post.
I wanted to dive a bit into implementation and best practices, but I think this post is already long enough. I had originally planned this as a two-part series, but on a whim I opted to dive deeper into architecture decisions here, instead of just picking a solution and implementing it. In part 3 we'll dive into how to get started with this solution's implementation, focusing on best practices.
Did you like this issue?