Aurora Global Database for Disaster Recovery on AWS
Need a Disaster Recovery solution for relational databases in AWS? Learn how to set up an Amazon Aurora Global Database for automatic failover.
I'm sure you're familiar with Amazon Aurora already, and how automatic failover to replicas lets you create highly available applications easily. I'd like to call your attention to Aurora Global Database, a feature that lets you create replicas of Aurora databases in other regions. With this feature, you can get the same experience as with a single-region Amazon Aurora cluster, but globally, making Disaster Recovery much easier.
Amazon Aurora Basic Concepts
Amazon Aurora is a managed relational database service from AWS that lets you create and manage database clusters easily. It's similar to Amazon RDS, but it works with AWS's on version of the MySQL and Postgres database engines. These are the basic concepts of Aurora:
Cluster: An Aurora Cluster is a group of instances, where you have a primary instance and zero or more replicas.
Primary instance: The primary instance is the database instance responsible for handling all write operations, and potentially also read operations.
Replica: Replicas are read-only instances that can handle read operations. They also act as failover replicas: if the primary instance fails, a replica chosen at random is promoted to become the new primary instance.
Endpoints: The cluster has two DNS records: a read-write endpoint that can accept read and write operations, and a read-only endpoint that only accepts read operations. The read-write endpoint resolves to the current primary instance, and the read-only endpoint resolves to replicas if present, or to the primary instance if there are no replicas. These endpoints are updated automatically if replicas are created, deleted, or promoted to primary.
How Does an Amazon Aurora Global Database Work?
An Aurora Global Database is a global cluster comprised of several regional Aurora clusters, each in a different region. Every cluster has its own instances, though there is only one primary instance across the entire global cluster.
Data is replicated across all the regional clusters inside the global cluster. If an entire AWS region fails, the Aurora cluster in that region will fail entirely. In that case, another cluster is promoted to primary, and one of the replicas in that cluster is made into a primary instance.
Replication is asynchronous, meaning the primary instance isn't blocked while data is being replicated. Replication across regions has an SLA of 1 second, which means an Aurora global database will have an RPO of 1 second.
Using an Aurora Global Database for Disaster Recovery
Disaster Recovery means that an application is capable of either continuing to operate or restoring operations quickly in the case of an entire AWS Region failing. The most crucial point in Disaster Recovery is backing up data to a Disaster Recovery region.
For a Backup and Restore Disaster Recovery strategy, database snapshots are more than enough. However, disaster recovery strategies such as Pilot Light and Warm Standby require you to keep a live copy of the data, to greatly improve the time it takes to restore the service (which is called Recovery Time Objective, or RTO).
An Aurora Global Database lets you implement real-time cross-region replication of the data very easily. Additionally, the automated failover mechanism ensures that your application can recover quickly and automatically from the failure of an AWS region. With a replication lag of under 1 second, data loss in the event of a disaster is minimal.
When the primary AWS region is back online, you can use Aurora's switchover feature to promote to primary an Aurora instance in your original AWS region. This takes only one click, 5 minutes, and doesn't produce any data loss.
Replication and RPO in an Aurora Global Database
An Aurora Global Database provides a Recovery Point Objective (RPO) of 1 second. This means data loss in the event of a disaster is minimal. Furthermore, 1 second is an enormous improvement when compared to database snapshots, which typically offer an RPO of 6 to 24 hours.
An Aurora Global Database is more expensive than just using database snapshots, because you need to pay for a continuously running database instance in your disaster recovery region. However, the configuration, maintenance, and failover process is significantly easier. This reduces the engineering hours that you need to invest in setting up your disaster recovery solution, which in turn reduces your costs.
Setting Up an Aurora Global Database on AWS
Creating an Aurora Global Database requires two steps: Creating the primary cluster in your primary AWS region, and adding a secondary cluster in a new region. Here are step-by-step instructions to do both.
Creating the Primary Aurora Cluster in the AWS Console
Sign in to the AWS Console
Go to the Amazon RDS console: https://console.aws.amazon.com/rds/
Click Create database
For database creation method, select Standard create
In the Engine options section, for
Engine type, select Aurora (PostgreSQL Compatible)
Under Engine version, expand Show filters and enable Show versions that support the global database feature
For Engine version, select the latest version of Aurora PostgreSQL that you can
Under Templates, select Dev/Test for this test, or Production if you're creating a production database
Under Settings, enter a name for your primary cluster
Enter a secure password, or let AWS generate one
For DB instance class, select
db.r5.largeor another memory optimized DB instance class.
Under Availability & durability, select the option to let Aurora create an Aurora Replica in a different AZ. You need one replica in the primary region for each secondary region that you want to add.
Under Connectivity, select your desired VPC
Under Additional configuration, enter a name for Initial database name
Leave the defaults selected for the DB cluster parameter group and DB parameter group
Accept all other default settings for Additional configuration
Click Create database
Wait until the cluster finishes creating. It can take several minutes
Adding a Secondary Aurora Cluster
Sign in to the AWS Console
Open the Amazon RDS console: https://console.aws.amazon.com/rds/
In the panel on the left, click on Databases
Select the Aurora global database that you just created. Make sure the primary Aurora DB cluster is
Click on Actions and click Add region
On the Add a region page, select your disaster recovery region as the secondary AWS Region
Complete the rest of the options with the same values you used for your primary cluster
Click Add region
You'll see the new secondary cluster in the list of Databases in the AWS Management Console:
Failover with Aurora Global Databases
Automatic failover is the critical feature of Aurora Global Databases that we're interested in, besides the 1-second replication. It's going to provide nearly-uninterrupted service for our workload in the event of a primary region outage.
When an outage is detected in the primary region, Aurora Global Databases automatically promotes the cluster in the secondary region to a primary cluster. The read-write and read-only endpoints are also updated automatically, to point to the instances in the new primary region. This way, you don't need to perform any manual actions, neither for the instances nor to update the connection details.
While the failover procedure for the database is entirely managed by Aurora, you should still test your disaster recovery procedures regularly. You can use AWS Fault Injection Simulator to simulate failures in Aurora instances, or even in an entire AWS region.
Switchover to the Original Region
After a disaster event and the failover to your disaster recovery region, you'll typically want to revert back to the original primary region. The best way to do this with no data loss is to manually trigger a switchover process.
During a switchover, Aurora promotes to primary the cluster that is in your target switchover region. When switching back to your original primary region after a failover, you'll want to select your original primary region as the target switchover region.
When you initiate a switchover process, Aurora first waits for all secondary region clusters to be completely synchronized with the cluster in the primary region. Then, the Aurora cluster in the currently primary region becomes read-only, and the chosen secondary cluster promotes one of its replicas to primary status. This promotion allows that cluster to assume the role of primary cluster within the global Aurora cluster.
Since all the secondary clusters were synchronized with the primary when the process started, the new primary cluster can continue to operate without any loss of data. Keep in mind, your database will be unavailable for a short period of time while the clusters assume their new roles.
Best Practices for a Global Aurora Cluster Switchover
To maximize availability and ensure a smooth switchover, you should keep the following best practices in mind:
Perform this operation during off (non-peak) hours, or at any time when writes to the Aurora cluster are minimal
Put your application in read-only mode to prevent it from sending write operations to the Aurora cluster while the switchover is taking place
To have an idea of how long the switchover will take, check the
AuroraGlobalDBRPOLagmetric for all secondary Aurora DB clusters
Make sure the cluster configurations are the same for the primary and secondary clusters. These configurations aren't copied automatically, and a mismatch can cause the switchover to fail, or the database to experience issues.
Step-by-Step Instructions to Perform a Switchover
Sign in to the AWS Console
Open the Amazon RDS console: https://console.aws.amazon.com/rds/
On the menu on the left, click Databases
Select the database you want to switch over
Click Actions, and click Switch over or fail over global database
Select the Switchover option
Under New primary cluster, select the cluster that you want to become the new primary
When the switchover process finishes, you'll be able to see the Aurora DB clusters and their current roles in the Databases list:
How to Monitor an Aurora Global Database?
There are many options that let you monitor the performance of your Aurora clusters that make up the Aurora Global Database:
Amazon RDS Performance Insights: Enables the database performance schema in the database engine.
Enhanced monitoring: Generates CloudWatch Metrics for CPU utilization by process and thread.
Amazon CloudWatch Logs: Publishes database logs to CloudWatch Logs. Errors are published to CloudWatch Logs by default, but you can enable additional logs depending on your database engine.
This is what the Monitoring tab looks like:
Using Amazon RDS Performance Insights to Monitor an Aurora Global Database
You can monitor Aurora Global Databases using Amazon RDS Performance Insights, just like you would with regional Aurora clusters. Performance Insights needs to be enabled individually on each cluster that's part of your global database. Keep in mind that you'll need to also configure it for newly-added clusters, since it's not inherited from the global cluster.
When you're viewing the Performance Insights page, you can switch regions to view the different instances. If you're not seeing up-to-date information, you'll need to select the name of the instance again.
Using Database Activity Streams to Monitor Aurora Global Databases
Database Activity Streams is a feature that logs in an Amazon Kinesis Data Stream all queries, transactions and database API calls. It's used for auditing activity in a database, and can be integrated with external services for compliance validation.
In an Aurora Global Database, you need to create an Activity Stream on each cluster separately. Each cluster delivers its audit data to its own Kinesis stream, within its own AWS Region. Failover and switchover activities do not impact activity streams, information is still streamed to the corresponding Kinesis data stream. A regional failure will most likely cause the Amazon Kinesis service to fail as well, but since the Aurora cluster in that region would also be failing, that shouldn't be a problem.
Monitoring Aurora Global Databases running Aurora MySQL
Here's how you can monitor an Aurora Global Database that's using Aurora MySQL:
Connect to the global database primary cluster endpoint using a MySQL client
information_schema.aurora_global_db_statustable. This SQL query returns the replication lag times for the secondary Aurora DB clusters in the global database:
mysql> select * from information_schema.aurora_global_db_status; AWS_REGION | HIGHEST_LSN_WRITTEN | DURABILITY_LAG_IN_MILLISECONDS | RPO_LAG_IN_MILLISECONDS | LAST_LAG_CALCULATION_TIMESTAMP | OLDEST_READ_VIEW_TRX_ID -----------+---------------------+--------------------------------+------------------------+---------------------------------+------------------------ us-east-1 | 183537946 | 0 | 0 | 1970-01-01 00:00:00.000000 | 0 us-west-2 | 183537944 | 428 | 0 | 2023-02-18 01:26:41.925000 | 20806982 (2 rows)
information_schema.aurora_global_db_instance_statustable to list all secondary DB instances for both the primary DB cluster and the secondary DB clusters:
mysql> select * from information_schema.aurora_global_db_instance_status; SERVER_ID | SESSION_ID | AWS_REGION | DURABLE_LSN | HIGHEST_LSN_RECEIVED | OLDEST_READ_VIEW_TRX_ID | OLDEST_READ_VIEW_LSN | VISIBILITY_LAG_IN_MSEC ---------------------+--------------------------------------+------------+-------------+----------------------+-------------------------+----------------------+------------------------ ams-gdb-primary-i2 | MASTER_SESSION_ID | us-east-1 | 183537698 | 0 | 0 | 0 | 0 ams-gdb-secondary-i1 | cc43165b-bdc6-4651-abbf-4f74f08bf931 | us-west-2 | 183537689 | 183537692 | 20806928 | 183537682 | 0 ams-gdb-secondary-i2 | 53303ff0-70b5-411f-bc86-28d7a53f8c19 | us-west-2 | 183537689 | 183537692 | 20806928 | 183537682 | 677 ams-gdb-primary-i1 | 5af1e20f-43db-421f-9f0d-2b92774c7d02 | us-east-1 | 183537697 | 183537698 | 20806930 | 183537691 | 21 (4 rows)
Monitoring Aurora Global Databases running PostgreSQL
You can monitor an Aurora Global Database that's using Aurora PostgreSQL with the following commands:
Connect to the global database primary cluster endpoint using psql or your favorite database client
aurora_global_db_statusfunction in a psql command to list the primary and secondary volumes:
postgres=> select * from aurora_global_db_status(); aws_region | highest_lsn_written | durability_lag_in_msec | rpo_lag_in_msec | last_lag_calculation_time | feedback_epoch | feedback_xmin ------------+---------------------+------------------------+-----------------+----------------------------+----------------+--------------- us-east-1 | 93763984222 | -1 | -1 | 1970-01-01 00:00:00+00 | 0 | 0 us-west-2 | 93763984222 | 900 | 1090 | 2020-05-12 22:49:14.328+00 | 2 | 3315479243 (2 rows)
aurora_global_db_instance_statusfunction to list all secondary DB instances for both the primary DB cluster and secondary DB clusters:
postgres=> select * from aurora_global_db_instance_status(); server_id | session_id | aws_region | durable_lsn | highest_lsn_rcvd | feedback_epoch | feedback_xmin | oldest_read_view_lsn | visibility_lag_in_msec --------------------------------------------+--------------------------------------+------------+-------------+------------------+----------------+---------------+----------------------+------------------------ apg-global-db-rpo-mammothrw-elephantro-1-n1 | MASTER_SESSION_ID | us-east-1 | 93763985102 | | | | | apg-global-db-rpo-mammothrw-elephantro-1-n2 | f38430cf-6576-479a-b296-dc06b1b1964a | us-east-1 | 93763985099 | 93763985102 | 2 | 3315479243 | 93763985095 | 10 apg-global-db-rpo-elephantro-mammothrw-n1 | 0d9f1d98-04ad-4aa4-8fdd-e08674cbbbfe | us-west-2 | 93763985095 | 93763985099 | 2 | 3315479243 | 93763985089 | 1017 (3 rows)
An Aurora Global Database is a simple and very effective way to implement a disaster recovery strategy for your Postgres or MySQL relational databases. It offers a 1-second RPO and an RTO of around 5 minutes, which is more than enough for a Pilot Light or Warm Standby disaster recovery strategy. The management overhead is minimal when compared to running your Aurora database in a single region. Furthermore, the additional cost of an extra cluster in your disaster recovery region is what's expected of a disaster recovery plan that keeps the data live in the disaster recovery region.