- Simple AWS
- Posts
- Handling Data at Scale with DynamoDB
Handling Data at Scale with DynamoDB
20 steps to optimize DynamoDB for lower costs and better performance at any scale.
Amazon DynamoDB is a fully managed NoSQL database. NoSQL databases are much more performant than relational databases for simple queries (and much slower for complex queries).
DynamoDB logo
Storing and Querying User Profile Data at Scale with Amazon DynamoDB
Let's take this scenario: As a software company with millions of users, you need to store and query user profile data in a scalable and reliable way. You could use a relational database like MySQL or PostgreSQL, but handling that volume is expensive, and at some point you'll have scalability problems.
Some key features of DynamoDB:
Flexible data model: Data is stored as groups of values, called items. You can retrieve the whole item, or just a few attributes.
Low latency: DynamoDB can handle millions of requests per second with single-digit millisecond latency.
Scalability: You can set Read Capacity Units and Write Capacity Units separately. You can also set them to auto-scale. Or you can just pay per request. Here's a deep dive on how DynamoDB scales.
Availability: A DynamoDB table is highly available within a region. You can also set it to replicate to other regions, with a global table.
Security: Data is encrypted at rest (natively) and in transit (not natively). You can set access controls at the field level, and audit using CloudTrail.
Streams: DynamoDB streams can trigger behavior in response to events. This makes building event-driven applications significantly easier.
Solution: Let's go over how to set up a DynamoDB table for user profiles, and how to create, query, update and delete user profiles.
First you need to design the table schema. Wait, schema? Didn't you say NoSQL? Yeah, but here's the catch: NoSQL doesn't mean No Schema, it means schema is not enforced by the database engine. You can put anything in a NoSQL database, but you shouldn't. You need to plan the schema for your table's use cases. A good starting point is to map out the data you need to store for each user, such as their name, email, and any other relevant information.
After that you need to set the primary key for the table. The PK is a unique identifier for each item in the table, and it is used to retrieve data from the table. You can choose either a single attribute (such as the user's email address), or you can use a composite PK consisting of two attributes (such as the user's email and a timestamp), where the first one is called partition key and the second one sort key. It's important to choose a primary key that will be unique for each user and that will be used to query the data.
You may add secondary indexes to your table, which allow you to query the data in the table using attributes other than the primary key. You can also do this later.
Once you have designed the table schema, just create the table in DynamoDB.
To store user profiles, use the PutItem API. You could also use the BatchWriteItem API to insert multiple profiles at once.
To query the profile of a specific user, use the Query API with the userId as the partition key. You can also use the sort key to further narrow down the results.
To update a user profile, use the UpdateItem API to update specific attributes of a profile without having to rewrite the entire item. If you do want to rewrite the entire item, use the PutItem API.
To delete a user profile, use the DeleteItem API.
How to Optimize
Always query on indexes: When you query on an index, DynamoDB only reads the items that match the query, and only charges you for that. When you query on a non-indexed attribute, DynamoDB scans the entire table and charges you for reading every single item (it filters them afterwards).
Use Query, not Scan: Scan reads the entire table, Query uses an index. Scan should only be used for non-indexed attributes, or to read all items. Don't mix them up.
Don't read the whole item: Read Capacity Units used are based on the amount of data. Use projection expressions to define which attributes will be retrieved, and only get the data you need.
Always filter and sort based on the sort key: You can filter and sort based on any attribute. If you do so based on an attribute that's a sort key, DynamoDB uses the index and you only pay for the items read. If you use an attribute that's not a sort key, DynamoDB scans the whole table and charges you for every item on the table. This is independent of whether you query for the partition key or not.
Set up Local Secondary Indexes: A Local Secondary Index is an index with the same Partition Key and a different Sort Key. Create them if you need to filter or sort based on other attributes. For example, you could create an LSI to filter on the "active" attribute of your users. LSIs share capacity with the table. There's no extra pricing, but DynamoDB will use additional write capacity units to update the relevant indexes.
Set up Global Secondary Indexes: A Global Secondary Index is an index with a different Partition Key and Sort Key, but with the same data, or just a subset of the data. Create them if you need to query based on other attributes. For example, you could create a GSI on a user's email address, so you can query based on that attribute. GSIs have separate capacity from the table. As with LSIs, GSIs don't cost extra per index, but DynamoDB will use extra write capacity units if you have indexes.
Paginate results: When retrieving large amounts of data, use pagination to retrieve the data in chunks.
Use caching: DynamoDB is usually fast enough (if it's not, use DynamoDB Accelerator (DAX)). However, ElastiCache can be cheaper for data that's updated infrequently.
Mind read consistency: DynamoDB reads are eventually consistent by default. You can also perform strongly consistent reads, which cost 2x more (2x cost in On-Demand mode, 2x more WCUs used in Provisioned mode).
Use transactions when needed: Operations are atomic, but if you need to perform more than one operation atomically, you can use a transaction. Cost is 2x the regular operation.
Use Reserved Capacity: You can reserve capacity units, by paying upfront or committing to pay monthly, for 1 or 3 years.
Prefer Provisioned mode over On-Demand mode: On-Demand is easier, but over 5x more expensive (without reserved capacity). Understand how DynamoDB scales in On-Demand and Provisioned mode. Also, consider adding an SQS queue to throttle writes.
Monitor and optimize: You're not gonna get it right the first time (because requirements change). Monitor usage with CloudWatch, and optimize schema and queries as needed. Remember secondary indexes.
Mind the costs: You're charged per data stored and per capacity units. For Provisioned mode, one read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for an item up to 4 KB in size; and one write capacity unit represents one write per second for an item up to 1 KB in size.
Optimize frequently: The key here is to understand how the database will be used and tune it accordingly (set secondary indexes, attribute projections, etc). This requires good upfront design and ongoing efforts.
Use Standard-IA tables: For most workloads, a standard table is the best choice. But for workloads that are read infrequently, use the Standard-IA table class to reduce costs.
Back up your data: You can set up scheduled backups, or on-demand backups.
Set a TTL: Some data needs to be stored forever, but some data can be deleted after some time. You can automate this by setting a TTL on each item.
Design partition keys carefully: Good design is extremely important in every database, even in NoSQL databases. Pick your partition key carefully so that load is split across partitions.
Don't be afraid to use multiple databases: DynamoDB is amazing for repetitive queries with different parameters (e.g. finding a user by email address), and terrible for complex analytics (e.g. finding every user that logged in more than once in the past 24 hours). Don't be afraid to use a different database for data or use cases that don't fit DynamoDB's strengths.
Recommended Tools and Resources
Check out this introductory workshop and this advanced workshop.
For relational databases, you probably use DBeaver (if not, check it out). For DynamoDB, you use NoSQL Workbench. And if you work with DynamoDB every day and can spare $9/month, Dynobase is worth the price.
Want to test locally? Use DynamoDB Local.
If getting AWS Certified is among your new year's resolutions, let me recommend Adrian Cantrill's courses. With their mix of theory and practice, they're the best I've seen. I've literally bought them all.
Did you like this issue? |
Reply