Simple AWS: Handling Data at Scale with DynamoDB
19 steps to optimize DynamoDB for lower costs and better performance at any scale.
Welcome to Simple AWS! A free newsletter that helps you build on AWS without being an expert. This is issue #9, and the first one with the new format: Weekly, longer issues around a use case. I hope you like the changes! Shall we?
Use case: Storing and Querying User Profile Data at Scale
AWS Services involved: DynamoDB
Scenario: As a software company with millions of users, you need to store and query user profile data in a scalable and reliable way. You could use a relational database like MySQL or PostgreSQL, but handling that volume is expensive, and at some point you'll have scalability problems.
Services: Amazon DynamoDB is a fully managed NoSQL database. NoSQL databases are much more performant than relational databases for simple queries (and much slower for complex queries).
Some key features of DynamoDB:
Low latency: DynamoDB can handle millions of requests per second with single-digit millisecond latency.
Solution: Let's go over how to set up a DynamoDB table for user profiles, and how to create, query, update and delete user profiles.
First you need to design the table schema. Wait, schema? Didn't you say NoSQL? Yeah, but here's the catch: NoSQL doesn't mean No Schema, it means schema is not enforced by the database engine. You can put anything in a NoSQL database, but you shouldn't. You need to plan the schema for your table's use cases. A good starting point is to map out the data you need to store for each user, such as their name, email, and any other relevant information.
After that you need to set the primary key for the table. The PK is a unique identifier for each item in the table, and it is used to retrieve data from the table. You can choose either a single attribute (such as the user's email address), or you can use a composite PK consisting of two attributes (such as the user's email and a timestamp), where the first one is called partition key and the second one sort key. It's important to choose a primary key that will be unique for each user and that will be used to query the data.
You may add secondary indexes to your table, which allow you to query the data in the table using attributes other than the primary key. You can also do this later.
To query the profile of a specific user, use the Query API with the userId as the partition key. You can also use the sort key to further narrow down the results.
To delete a user profile, use the DeleteItem API.
How to Optimize
Always query on indexes: When you query on an index, DynamoDB only reads the items that match the query, and only charges you for that. When you query on a non-indexed attribute, DynamoDB scans the entire table and charges you for reading every single item (it filters them afterwards).
Always filter and sort based on the sort key: You can filter and sort based on any attribute. If you do so based on an attribute that's a sort key, DynamoDB uses the index and you only pay for the items read. If you use an attribute that's not a sort key, DynamoDB scans the whole table and charges you for every item on the table. This is independent of whether you query for the partition key or not.
Set up Local Secondary Indexes: A Local Secondary Index is an index with the same Partition Key and a different Sort Key. Create them if you need to filter or sort based on other attributes. For example, you could create an LSI to filter on the "active" attribute of your users. LSIs share capacity with the table. There's no extra pricing, but DynamoDB will use additional write capacity units to update the relevant indexes.
Set up Global Secondary Indexes: A Global Secondary Index is an index with a different Partition Key and Sort Key, but with the same data (or just a subset of the data, which saves a ton of costs). Create them if you need to query based on other attributes. For example, you could create a GSI on a user's email address, so you can query based on that attribute. GSIs have separate capacity from the table. As with LSIs, GSIs don't cost extra per index, but DynamoDB will use extra write capacity units if you have indexes.
Paginate results: When retrieving large amounts of data, use pagination to retrieve the data in chunks.
Mind read consistency: DynamoDB reads are eventually consistent by default. You can also perform strongly consistent reads, which cost 2x more (2x cost in On-Demand mode, 2x more WCUs used in Provisioned mode)
Use transactions when needed: Operations are atomic, but if you need to perform more than one operation atomically, you can use a transaction. Cost is 2x the regular operation.
Use Reserved Capacity: You can reserve capacity units, by paying upfront or committing to pay monthly, for 1 or 3 years.
Prefer Provisioned mode over On-Demand mode: On-Demand is easier, but over 5x more expensive (without reserved capacity). Provisioned mode scale pretty fast, try to use it if your traffic doesn't spike that fast. Also, consider adding an SQS queue to throttle writes.
Mind the costs: You're charged per data stored and per capacity units. For Provisioned mode, one read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size; and one write capacity unit represents one write per second for an item up to 1 KB in size. Optimize frequently. The key here is to understand how the database will be used and tune it accordingly (set secondary indexes, attribute projections, etc). This requires good upfront design and ongoing efforts.
Set a TTL: Some data needs to be stored forever, but some data can be deleted after some time. You can automate this by setting a TTL on each item.
Don't be afraid to use multiple databases: DynamoDB is amazing for repetitive queries with different parameters (e.g. finding a user by email address), and terrible for complex analytics (e.g. finding every user that logged in more than once in the past 24 hours). Don't be afraid to use a different database for data or use cases that don't fit DynamoDB's strengths.
For relational databases, you probably use DBeaver (if not, check it out). For DynamoDB, you use NoSQL Workbench. And if you work with DynamoDB every day and can spare $9/month, Dynobase is worth the price.
If getting AWS Certified is among your new year's resolutions, let me recommend Adrian Cantrill's courses. With their mix of theory and practice, they're the best I've seen. I've literally bought them all (haven't watched them all yet). <-- This recommendation contains affiliate links.
Some of the above recommendations are paid promotions or contain affiliate links. They are clearly marked as such. You should know I only recommend things I've tried for myself and found actually useful, regardless of whether I get paid for it or not.
It's a new year, and it's a big change. I think this format is better at helping you build on AWS without being an expert. What do you think? I created a poll, would you please fill it out?
I hope you had a great New Year's celebration, and a fantastic start to 2023!
Thank you for reading! See ya on the next issue.