- Simple AWS
- Posts
- Using ElastiCache (Redis) to cache queries to RDS
Using ElastiCache (Redis) to cache queries to RDS
Setting up a cache-aside strategy, and the implications beyond just cost
You're running an app that connects to a Postgres database, and your database charges are through the clouds (pun intended). When analyzing the most common queries, you notice that 98% of the database traffic is the same query: SELECT * FROM simple_aws WHERE message='Thank you for being a subscriber!' LIMIT 1;
You'd like to implement a way to cache the result of that query for 10 minutes, so it doesn't have to get re-calculated every single time. You thought about storing it in shared memory or in the instance's disk, but you're going to make this scalable and possibly serverless at some point, and you'd like a solution that can also handle those cases.
We're going to use the following AWS services:
EC2: I'm just using this to host the app
RDS: Our managed Postgres, for which we want a cache
ElastiCache: A managed cluster of Redis or Memcache, which we're going to use to cache query results.
Configure ElastiCache (Redis) and change the code to query ElastiCache before querying RDS
In this case I went ahead and created the whole solution for you, mostly because I didn't want you to have to SSH into the instance to add the ElastiCache address as an env var (actually, I didn't want to write the step-by-step instructions for that, or put the app in something better than an EC2 instance).
Here's the complete code, and here's the initial setup, which you can:
Step by step
Like I said, I just created the whole solution, so you won't need to execute these steps. I'll just show them and explain them. Assume our initial state is a Node.js app deployed in an EC2 instance, that connects to an RDS database. Here's the code:
index.js
const express = require('express');
const cors = require('cors');
require('dotenv').config();
const dbNoCache = require('./db-no-cache');
const dbWithCache = require('./db-with-cache');
require('./setup.js');
const app = express();
app.use(cors());
const slowQuery = 'SELECT * FROM simple_aws WHERE message='Thank you for being a subscriber!' LIMIT 1;';
app.get('/withcache', async (req, res) => {
const start = Date.now();
// Fetch the record from database
const dbResponse = await dbWithCache.query(slowQuery, []);
const end = Date.now();
const elapsedTime = end - start;
// Set no-cache headers
res.set('Cache-Control', 'no-store, no-cache, must-revalidate, private');
// Send response
res.json({ data: dbResponse, elapsedTime: elapsedTime });
});
app.get('/nocache', async (req, res) => {
const start = Date.now();
// Fetch the record from database
const dbResponse = await dbNoCache.query(slowQuery, []);
const end = Date.now();
const elapsedTime = end - start;
// Set no-cache headers
res.set('Cache-Control', 'no-store, no-cache, must-revalidate, private');
// Send response
res.json({ data: dbResponse, elapsedTime: elapsedTime });
});
// Set the port for the application
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Server is running on port ${PORT}.`);
});
Step 1: Set up an ElastiCache (Redis) cluster
Go to the ElastiCache console
On the left, click Redis clusters
Click Create Redis cluster
For Choose a cluster creation method leave it as Easy create
For Configuration, choose Demo
Give the cluster a Name and Description
Under Connectivity select Create a new subnet group
Enter a Name and Description for the Subnet group
Select a VPC for the Subnet group, and select at least 2 subnets (or leave it as it is for all subnets)
Click Create
Wait a few minutes
Click on the cluster you just created
Under Cluster details, copy the value for Primary endpoint
Step 2: Configure the app to query the cache first
As you saw in the initial state, we have an index module that uses a db module to access the data. I set up two different versions of the db module. The initial one is this (in the repo you'll find it as db-no-cache.js
):
I designed this so the index module would be (somewhat) isolated from the implementation details of how the database works, allowing us to insert the cache behavior in this module without needing to change the index module. This is the version with our changes (in the repo you'll find it as db-with-cache.js
):
const { Pool } = require('pg');
const redis = require('./cache');
const pool = new Pool({
host: process.env.DB_HOST,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
database: process.env.DB_NAME,
port: process.env.DB_PORT
});
module.exports = {
query: async (text, params) => {
if (text.toLowerCase().startsWith('select')) {
const key = JSON.stringify({ text, params });
const cacheResult = await redis.get(key);
if (cacheResult) {
return JSON.parse(cacheResult);
} else {
const dbResult = await pool.query(text, params);
await redis.set(key, JSON.stringify(dbResult.rows));
return dbResult.rows;
}
} else {
return pool.query(text, params);
}
},
flushCache: async () => {
await redis.flushall();
}
};
Step 3: Test it
As you probably saw in the index module, there's 3 paths:
/withcache: Runs the database query using the cache. Should be slow the first time, and a lot faster for subsequent invocations.
/nocache: Runs the database query without using the cache. Should be slow in every invocation.
/reset: Resets the cache, so you can test from scratch without redeploying everything.
The CloudFormation template outputs the EC2 instance's public IP address (or you can find it in the EC2 console). You'll need to paste this in the browser:
:3000/
Replace with the instance's public IP address, and with the path you want, such as /withcache. The response will show you the result of the query (the same in all cases), and the time it took to execute on the server, so you can compare the performance of using the cache vs not using it.
Explanation
Step 1: Set up an ElastiCache (Redis) cluster
Redis is a key-value storage, where we store any value under a key, and retrieve it really fast by querying by that key. It's much faster than Postgres for this because there's nothing to calculate: We query it by a key and either the value is there and we get it instantly, or it's not there and we don't get anything.
What we're going to do is use the query as a key, and store the result as the value for that key. That way, every time our backend needs the result for a query such as 'SELECT * FROM simple_aws WHERE message='Thank you for being a subscriber!' LIMIT 1;'
, it can check whether the value is in Redis or not, and save Postgres the trouble of calculating that value. Keep in mind Redis doesn't process the query at all, it's just a string value that it will hash and use as the key in its own storage.
Step 2: Configure the app to query the cache first
The important part is replacing this code
query: (text, params) => {
return pool.query(text, params);
},
With this code
query: async (text, params) => {
if (text.toLowerCase().startsWith('select')) {
const key = JSON.stringify({ text, params });
const cacheResult = await redis.get(key);
if (cacheResult) {
return JSON.parse(cacheResult);
} else {
const dbResult = await pool.query(text, params);
await redis.set(key, JSON.stringify(dbResult.rows), 'EX', 600);
return dbResult.rows;
}
} else {
return pool.query(text, params);
}
},
First we're making sure this cache logic only applies to SELECTs (this is a rather crude way to do it, I know). Then we try to get the result from Redis, using the query as the key. If that succeeds, we just return that, and that's it. If it fails, we query Postgres, and then store the result in Redis for subsequent queries, before returning it.
The 'EX', 600
part is setting a Time To Live (TTL) for that value: After 600 seconds (10 minutes), Redis will delete that value. This is to ensure old values aren't cached forever, and this way we're guaranteeing that the cached value is at most 10 minutes old (which doesn't guarantee that it's the latest value). An alternative to this is to reset the value on every write (if this is the only app that writes to this database), that way when the value changes in Postgres the now old value is invalidated in Redis. I didn't want to complicate it too much though.
Step 3: Test it
Both endpoints /withcache and /nocache run the same query:
SELECT * FROM simple_aws WHERE message='Thank you for being a subscriber!' LIMIT 1;
It's a simple query, but that table has a million records, so it takes Postgres a couple of milliseconds to find the right one (I didn't add any indices on purpose). That's only relevant when going to Postgres though, when Redis resolves the query it doesn't process it, just fetches the value. Try it with the /withcache endpoint and you'll see the difference.
Discussion
This is called a cache-aside caching strategy, where the consumer (our app in this case) is responsible for querying the cache, querying the source (when necessary) and updating the cache. We've made it as least disrupting as possible to our code by using good code practices like not depending on specific data sources. That way, all we had to change was the implementation of query
in our db.js
module (which I wrote in 2 versions: db-with-cache.js
and db-no-cache.js
).
The alternative is a cache-through caching strategy, where the consumer communicates directly with the cache, and it's the cache's responsibility to query the source when necessary, and update itself. That's how CloudFront works: Our DNS points to the CloudFront endpoint, and we set our origin (servers, S3, etc) as an Origin of CloudFront. We have less control of what's cached and when (though CloudFront and most CDNs are very configurable in this regard), but our consumer is completely independent of the origin.
While this caching strategy can be very effective at reducing load to the database, we need to take into account a scenario where we get a 0% cache hit rate (the percentage of queries that are solved by the cache, without needing to access the database). That can happen after a long period of not updating the cache (e.g. after a system failure) or after a failure in the cache itself. In those cases, traffic to the database will be back to its "normal" as if we didn't have the cache (every query will need to go to the database, at least the first time until the cache is populated). If we implemented this strategy, saw a reduction in database usage, and reduced the size of our database instance to save costs, then our smaller database instance won't be able to support all the traffic, leading to a database failure and a complete system failure. If this happens after a restart, the system won't be able to start back up again, since it will continue failing for the same reason.
That presents a scary scenario for system failures: not being able to start it again. It can be fixed by throttling the traffic to the database, either with an SQS queue like we saw in a past issue, or by throttling traffic to the server itself using API Gateway. Or you can create a temporary read replica, which will mean you need to wait a bit longer, but it will also mean you don't have to change the architecture or drop traffic.
One upside of having a cache is that, if the database fails, you can still serve a degraded response: All queries that are in the cache can be responded to from the cache, with no code changes. Queries that are not in the cache, plus all writes, will fail, just like everything would fail if the database went down and we didn't have a cache. Not ideal at all, but slightly less catastrophic than everything failing.
By the way, you might correctly guess that average response time for requests that are a cache miss increases very slightly. This is because we're adding an extra step: Querying Redis. Average time across all requests should go down significantly, because our particular SQL query is really slow. If we didn't care about cost and just wanted to lower response times, we could make both the request to Redis and to Postgres simultaneously, and respond with whichever answers faster (Redis if it's a cache hit, Postgres if it's a cache miss). RDS will see the same load, since every query will go to it (even for cache hits, where the result won't be used at all), and we'll be paying for Redis on top of that. However, average response time for cache misses will stay the same, and will decrease for cache hits.
Best Practices
Operational Excellence
Keep code modularized: I managed to implement this change because my db module keeps all database implementation details separate from the rest of the app (the index module). Mine is just a code sample, and there are much better ways to do this, especially if you're working with types (e.g. TypeScript). If you need to change more than one or two modules to implement this, do a refactor.
Security
Keep your databases private: I didn't bother setting up RDS and ElastiCache in a private subnet, but you definitely should.
Don't hardcode your passwords in your CloudFormation template!: I only did it to keep it as simple as possible, under the assumption that, with security groups blocking traffic from the internet, the hour or so that you'll keep this running just to play with it isn't enough for an attacker to break through. If you want to keep this running longer, don't hardcode passwords anywhere.
Reliability
Multi-AZ Deployment for ElastiCache: Enable multi-AZ deployments for the ElastiCache cluster (it will require more than one node) to ensure that the caching layer remains available and durable in the event of a node failure or other issues in a single availability zone.
Use Replicas: This one's probably not that useful for just a cache, where we can lose data without losing the actual data. Still, if you want to avoid the performance hit on RDS of losing a part of your cache, you might want to enable Replicas. They work just like database failover replicas: An extra instance in a different AZ, you pay twice, failover is done automatically. Not really recommended for a cache, but if you're using Redis as your main database, definitely a must.
Performance Efficiency
Enable Auto Scaling for ElastiCache: Use ElastiCache Auto Scaling to dynamically adjust the number of nodes. This will ensure your caching layer can handle traffic spikes while keeping costs in check during periods of lower demand.
Cache Warming: Warming up the cache means preloading it with data before it's used. If you know what queries to expect, you can schedule a Lambda function to query RDS and put the result in ElastiCache every 10 minutes or so, to keep the cache warm. Note that you could also set a higher TTL, such as 1 hour, but in that case you might be serving stale data for up to an hour. This alternative keeps the cache warm without serving data staler than 10 minutes.
Cost Optimization
Right-sizing ElastiCache instances: Same as with every instance, find the right size for your usage.
Monitor and Optimize Cache Hit Ratio: This whole solution being worth the money and trouble depends on your cache hit ratio being high enough. Keep and eye on it.
Use Reserved Instances for ElastiCache: Yup, there are Reserved Instances for ElastiCache. As usual, cost savings for stable workloads, with upfront or no upfront.
Resources
I purposefully glossed over exactly how Redis works. If you want to take a deep dive, I highly recommend this coding challenge to build your own Redis. Just don't use it for prod, you know we prefer managed services!
There's 2 ways to use AI to reduce costs in AWS: Ask ChatGPT what to do and do it all yourself, or use Pump to do it for you in just a few minutes. I usually doubt whether these things can help me save enough to justify their cost, but here's the crazy thing: Pump is free! (it actually monetizes through bulk discounts, not through users). The only thing I don't like is the minimum $1000/month AWS bill. But if you're above that, it's really worth a shot.
Did you like this issue? |
Reply