• Simple AWS
  • Posts
  • Aurora DSQL: The PostgreSQL-Compatible Database That Will Break Your App (And How to Fix It)

Aurora DSQL: The PostgreSQL-Compatible Database That Will Break Your App (And How to Fix It)

Aurora DSQL is a new service just launched by AWS. Marketed as the “Fastest serverless distributed SQL database for always available applications“, it promises PostgreSQL compatibility with virtually unlimited scale. But here's what AWS doesn't tell you upfront: it's going to break your existing PostgreSQL application. Not maybe. Not sometimes. It will break it (and that's actually by design).

I've been really excited about this since I heard about it at re:Invent 2024, but my reservations started when I talked to the team behind it and they told me there are no Foreign Keys. So now that it's on General Availability, I decided to spend a bit of time diving deep into Aurora DSQL and discovering all the ways this "compatible" database service differs from what you'd expect. The good news? Once you understand what you're dealing with, DSQL becomes incredibly powerful. The bad news? You're going to need to rethink how you build database applications. It's true, there are no Foreign Keys.

What Aurora DSQL Actually Is

Aurora DSQL isn't just Aurora with auto-scaling bolted on (that would be too easy). It's a fundamentally different beast that happens to speak PostgreSQL's language. Think of it as a distributed, serverless SQL database that borrowed PostgreSQL's syntax but threw out most of the operational model you're used to.

The architecture consists of four main layers that scale independently:

  • A relay and connectivity layer handling client connections.

  • Compute and database engines processing your queries.

  • A distributed transaction log managing concurrency and isolation.

  • A distributed storage layer.

A control plane orchestrates everything across three Availability Zones, automatically scaling each component based on demand (which sounds great until you see the DPU bills).

Here's where it gets interesting: DSQL uses Optimistic Concurrency Control instead of traditional locking. No more deadlocks, but also no more assuming your transaction will just work. Every write operation can fail at commit time if another transaction touched the same data, and your application needs to handle that gracefully. This isn't a edge case, it's normal operation.

The service comes in two flavors: single-Region clusters that operate active-active across three AZs within one AWS Region (meaning it's highly available out of the box), and multi-Region clusters that synchronously replicate across multiple regions with a "witness" region handling log consensus (much better than Aurora Global Databases). The single-Region setup targets 99.99% availability, while multi-Region promises 99.999% (those extra nines cost extra, obviously).

What makes this serverless is that you don't provision instances or manage scaling policies. DSQL automatically partitions your data based on primary key ranges and access patterns, spins up compute resources as needed, and bills you in Distributed Processing Units (DPUs) for actual usage. It's infrastructure abstraction taken to its logical extreme (for better and worse). At least it's truly truly serverless, not like Aurora Serverless.

The Three Things That Will Surprise You

1. Optimistic Concurrency Control Will Change How You Code

Forget everything you know about database transactions. DSQL doesn't use locks during transaction execution, it validates conflicts only at commit time. This means any write transaction can fail with a serialization error, even if it ran perfectly during execution.

When conflicts occur, you'll see PostgreSQL serialization errors with SQLSTATE '40001' or DSQL-specific 'OC000' and 'OC001' error codes. Your application must catch these and retry the entire transaction. Not some transactions. All transactions. Here's what robust retry logic looks like:

public void executeWithRetry(Runnable transactionLogic) {
    int maxRetries = 3;
    int attempt = 0;
    
    while (attempt < maxRetries) {
        try {
            transactionLogic.run();
            return; // Success
        } catch (SQLException e) {
            if (e.getSQLState().startsWith("OC") || 
                e.getSQLState().equals("40001")) {
                attempt++;
                if (attempt >= maxRetries) throw e;
                
                // Exponential backoff with jitter
                Thread.sleep((long) (Math.pow(2, attempt) * 100 + Math.random() * 100));
            } else {
                throw e; // Non-retryable error
            }
        }
    }
}

The key insight: every operation must be idempotent. You can't just increment a counter and retry, you need to structure operations as "set counter to X if current value is Y" patterns. This fundamentally changes how you design business logic (trust me, it's a bigger shift than it sounds).

2. PostgreSQL "Compatibility" Has Some Creative Interpretations

DSQL speaks PostgreSQL's wire protocol and supports a substantial subset of SQL, but "compatible" doesn't mean "identical". The list of unsupported features reads like a PostgreSQL greatest hits album: no foreign keys, no triggers, no sequences, no PL/pgSQL, no user-defined functions except pure SQL ones. To be honest I can personally do without all of those, except foreign keys, that's the one thing I miss.

The isolation level is fixed at Repeatable Read (you can't change it), transactions have size limits of 10 MiB and 3,000 row modifications, and you can't mix DDL and DML in the same transaction. Each transaction can contain exactly one DDL statement, period.

Here's where it gets tricky for migrations. Take this seemingly innocent PostgreSQL table:

-- This works in PostgreSQL
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    total NUMERIC(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TRIGGER update_customer_stats 
    AFTER INSERT ON orders 
    FOR EACH ROW 
    EXECUTE FUNCTION update_customer_totals();

In DSQL, you'd need to rewrite this as:

-- DSQL version requires major changes
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id UUID,  -- No foreign key constraint
    total FLOAT,       -- NUMERIC not supported in keys
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Trigger logic moves to application layer (as it should!!)
-- Foreign key validation becomes application responsibility (this is annoying, but it's manageable)

The absence of foreign keys means referential integrity becomes your application's job. No more relying on the database to prevent orphaned records (we've all been there, and it's not fun to debug in production).

The trick here is that foreign keys do exist in your data model, they're just no longer enforced by the database engine. You can still write relationships in your code, but if you want them enforced and want to catch the related bugs, you must do that at the application layer. Give it a long, hard thought, and you'll see that it's not the end of the world, just a quality of life tradeoff. But seriously, give it some thought. It took me 20 minutes of conversation to come to this conclusion.

3. DDL Operations Are Distributed Transactions (With Consequences)

Schema changes in DSQL aren't simple metadata updates, they're strongly consistent distributed transactions across all cluster nodes. This has profound implications for how you manage schema migrations.

Every DDL operation runs as a distributed transaction, which means it can fail due to concurrency conflicts just like regular transactions. You might see errors like "ERROR: schema has been updated by another transaction, retry as needed (OC001)" during migrations.

The restrictions are strict: one DDL statement per transaction, no mixing DDL with DML, and schema changes can take time to propagate. But here's the superpower: CREATE INDEX ASYNC runs in the background without locking your table.

-- This runs immediately and returns a job_id
CREATE INDEX ASYNC idx_orders_customer 
ON orders(customer_id);

-- Monitor progress
SELECT * FROM sys.jobs WHERE job_id = 'your_job_id';

-- Wait for completion in scripts
SELECT sys.wait_for_job('your_job_id');

Unlike PostgreSQL where CREATE INDEX can lock your table for hours on large datasets, DSQL's async approach lets you add indexes to active tables without downtime. This is genuinely game-changing for large-scale applications (just don't forget to check the job status).

Getting Aurora DSQL Actually Working

Let's get our hands dirty with real setup, because since I spent time researching this I've decided you need to spend time on this as well. Also, it's important. DSQL requires IAM authentication (no traditional passwords), SSL connections (non-negotiable), and specific client configurations that differ from standard PostgreSQL.

Prerequisites and IAM Setup

First, you need IAM permissions for both cluster management and database connections. Here's the minimum policy for getting started:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dsql:CreateCluster",
                "dsql:DescribeCluster",
                "dsql:ListClusters",
                "dsql:GenerateDbConnectAdminAuthToken"
            ],
            "Resource": "*"
        }
    ]
}

For production environments, you'll also need customer-managed KMS keys and VPC endpoint permissions, but this gets you connected (security hardening comes later).

Creating Your First Cluster

Single-region cluster creation is straightforward via the AWS CLI:

# Create a basic single-region cluster
aws dsql create-cluster \
    --region us-east-1 \
    --deletion-protection-enabled \
    --tags Key=Environment,Value=development

# For multi-region (requires coordination)
aws dsql create-cluster \
    --region us-east-1 \
    --multi-region-properties '{
        "witnessRegion": "us-west-2",
        "clusters": ["arn:aws:dsql:us-east-2:account:cluster/peer-cluster"]
    }'

The cluster creation returns an ARN and endpoint URL. Save that endpoint, you'll need it for every connection (and it's not changing like RDS endpoints do during failovers).

Connecting to DSQL Using IAM Authentication

DSQL connections require IAM tokens instead of passwords, and these tokens have dual expiry mechanics that will trip you up if you haven't used them before. The token expires (default 15 minutes, maximum 1 week), but the database connection also expires after one hour regardless of token validity.

Generate tokens using the AWS CLI:

# Generate admin token (15-minute default)
aws dsql generate-db-connect-admin-auth-token \
    --hostname your-cluster-endpoint.dsql.amazonaws.com \
    --region us-east-1 \
    --expires-in 3600

# Generate token for specific database role
aws dsql generate-db-connect-auth-token \
    --hostname your-cluster-endpoint.dsql.amazonaws.com \
    --region us-east-1 \
    --username myuser \
    --expires-in 3600

Now connect with psql (note the required SSL mode):

PGSSLMODE=require psql \
    --dbname postgres \
    --username admin \
    --host your-cluster-endpoint.dsql.amazonaws.com \
    --password  # Paste the IAM token when prompted

For application connections, here's a Python example using psycopg:

import boto3
import psycopg
from urllib.parse import quote_plus

def get_dsql_connection():
    client = boto3.client('dsql', region_name='us-east-1')
    
    token = client.generate_db_connect_admin_auth_token(
        hostname='your-cluster-endpoint.dsql.amazonaws.com',
        expiresIn=3600
    )
    
    conn_string = (
        f"postgresql://admin:{quote_plus(token)}@"
        f"your-cluster-endpoint.dsql.amazonaws.com:5432/postgres"
        f"?sslmode=require"
    )
    
    return psycopg.connect(conn_string)

Still with me? The connection complexity is real, but it's the price of IAM integration (and honestly it's more secure than password-based auth once you get the patterns down).

Schema Design Patterns

Once connected, your first decision is primary key strategy. For write-heavy tables, avoid sequential integers like the plague, they create hotspots in DSQL's distributed storage:

-- Good: Random distribution for high-write tables
CREATE TABLE events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID,
    event_type VARCHAR(50),
    data JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Bad: Sequential keys create write hotspots
CREATE TABLE events_bad (
    id SERIAL PRIMARY KEY,  -- This will bottleneck
    user_id INTEGER,
    event_type VARCHAR(50),
    data JSONB,
    created_at TIMESTAMP
);

Create indexes asynchronously to avoid blocking operations:

-- Start async index creation
CREATE INDEX ASYNC idx_events_user_type 
ON events(user_id, event_type);

-- Check progress
SELECT job_id, status, progress_percentage 
FROM sys.jobs 
WHERE operation_type = 'CREATE_INDEX';

Key sizing limits apply: primary keys can't exceed 1 KiB total size, individual columns in keys max out at 255 bytes, and primary keys can include up to 8 columns. Plan accordingly (VARCHAR(320) email fields won't work in composite keys).

Where Aurora DSQL Gets Expensive (And How to Avoid It)

DSQL's pricing model using Distributed Processing Units (DPUs) can surprise you. Unlike provisioned databases where you pay for capacity whether you use it or not, DSQL bills for actual computation, I/O, and storage operations. This sounds great until you realize inefficient queries now directly impact your AWS bill instead of just wasting already wasted capacity.

Understanding DPU Consumption Patterns

DPUs measure normalized compute units across different operations. Here's what drives costs:

  • WriteDPU: Write operations including inserts, updates, deletes

  • ReadDPU: Query processing for SELECT operations

  • ComputeDPU: CPU-intensive operations like complex joins and aggregations

  • MultiRegionWriteDPU: Additional overhead for multi-region write replication

Monitor these with CloudWatch:

# Set up CloudWatch alarms for DPU spikes
aws cloudwatch put-metric-alarm \
    --alarm-name "DSQL-HighDPU" \
    --alarm-description "Aurora DSQL high DPU usage" \
    --metric-name TotalDPU \
    --namespace AWS/AuroraDSQL \
    --statistic Average \
    --period 300 \
    --threshold 1000 \
    --comparison-operator GreaterThanThreshold \
    --dimensions Name=ResourceId,Value=your-cluster-id

A single inefficient query scanning millions of rows can generate hundreds of DPUs in minutes. Query optimization isn't just about performance anymore, it's about keeping your AWS bill reasonable (ask me how I learned this).

Primary Key Design Impact on Performance and Cost

Remember how I mentioned UUIDs for write-heavy tables? Here's why it matters for your wallet. Sequential primary keys concentrate writes on single partitions (remember that the D in DSQL stands for Distributed), forcing DSQL to scale those specific partitions while others sit idle. This inefficient resource utilization shows up in your DPU costs.

-- This table design will cost you more
CREATE TABLE audit_log (
    id BIGINT PRIMARY KEY,  -- Sequential writes = hotspot
    action VARCHAR(100),
    timestamp TIMESTAMP,
    user_id UUID
);

-- This distributes load and costs
CREATE TABLE audit_log_optimized (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),  -- Distributes writes
    action VARCHAR(100),
    timestamp TIMESTAMP,
    user_id UUID
);

-- Create covering index for common queries
CREATE INDEX ASYNC idx_audit_user_time 
ON audit_log_optimized(user_id, timestamp);

The covering index is an additional trick that prevents expensive primary key lookups for common query patterns, reducing ReadDPUs significantly.

Connection Pools in Aurora DSQL

Traditional connection pools assume connections stay valid until explicitly closed or the database terminates them. DSQL breaks this assumption with dual expiry: IAM tokens expire, and connections automatically terminate after one hour.

Your connection pool needs to handle both scenarios:

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class DSQLConnection:
    connection: psycopg.Connection
    created_at: float
    token_expires_at: float
    
    def is_valid(self) -> bool:
        now = time.time()
        # Check both token expiry and DSQL's 1-hour limit
        return (now < self.token_expires_at and 
                now < self.created_at + 3540)  # 59 minutes

class DSQLConnectionPool:
    def __init__(self, min_connections=2, max_connections=10):
        self.min_connections = min_connections
        self.max_connections = max_connections
        self.connections = []
        self.dsql_client = boto3.client('dsql')
    
    def get_connection(self) -> psycopg.Connection:
        # Remove expired connections
        self.connections = [c for c in self.connections if c.is_valid()]
        
        if not self.connections:
            return self._create_new_connection()
        
        return self.connections.pop().connection
    
    def _create_new_connection(self) -> psycopg.Connection:
        token = self.dsql_client.generate_db_connect_admin_auth_token(
            hostname='your-cluster-endpoint.dsql.amazonaws.com',
            expiresIn=3600  # 1 hour
        )
        
        conn = psycopg.connect(
            f"postgresql://admin:{quote_plus(token)}@"
            f"your-cluster-endpoint.dsql.amazonaws.com:5432/postgres"
            f"?sslmode=require"
        )
        
        return conn

This complexity is manageable with the right abstractions, but it's definitely not plug-and-play with existing connection pooling libraries (most will need configuration adjustments at minimum).

Query Optimization for OCC and Transaction Limits

DSQL's transaction limits (10 MiB data modified, 3,000 rows affected) require batching strategies for large operations:

def batch_insert_with_retry(records, batch_size=1000):
    for i in range(0, len(records), batch_size):
        batch = records[i:i+batch_size]
        
        execute_with_retry(lambda: insert_batch(batch))

def insert_batch(records):
    with get_connection() as conn:
        with conn.cursor() as cur:
            # Use COPY for efficient bulk inserts
            with cur.copy("COPY events (user_id, event_type, data) FROM STDIN") as copy:
                for record in records:
                    copy.write_row([record.user_id, record.event_type, record.data])

The COPY command is significantly more DPU-efficient than individual INSERT statements for bulk operations, but you still need to respect the transaction size limits.

The Missing Pieces You Need to Know

Let's talk about the operational aspects AWS doesn't emphasize in the marketing materials. DSQL is powerful, but it requires new operational patterns and monitoring approaches.

Monitoring and Alerting That Actually Matters

Standard PostgreSQL monitoring tools won't help you with DSQL 🫠. You need CloudWatch dashboards focused on DSQL-specific metrics:

# Create custom CloudWatch dashboard
aws cloudwatch put-dashboard \
    --dashboard-name "Aurora-DSQL-Operations" \
    --dashboard-body '{
        "widgets": [
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/AuroraDSQL", "TotalDPU", "ResourceId", "your-cluster-id"],
                        ["AWS/AuroraDSQL", "OccConflicts", "ResourceId", "your-cluster-id"],
                        ["AWS/AuroraDSQL", "QueryTimeouts", "ResourceId", "your-cluster-id"]
                    ],
                    "period": 300,
                    "stat": "Average",
                    "region": "us-east-1",
                    "title": "DSQL Performance Metrics"
                }
            }
        ]
    }'

Critical metrics to monitor:

  • OccConflicts: High values indicate data contention requiring application logic review

  • QueryTimeouts: Suggests queries hitting DSQL's transaction time limits

  • TotalDPU: Your primary cost indicator and performance baseline

  • CommitLatency: Increased latency can indicate replication or storage issues

Set aggressive alarms on these metrics. OCC conflicts above 5% of total transactions usually indicate design problems that will only get worse under load.

Backup and Recovery Strategy (It's Different)

DSQL integrates with AWS Backup, not native database snapshots. This means backup policies, retention, and restoration work through the AWS Backup service:

# Create backup plan for DSQL
aws backup create-backup-plan \
    --backup-plan '{
        "BackupPlanName": "dsql-daily-backups",
        "Rules": [{
            "RuleName": "daily-backup",
            "TargetBackupVault": "default",
            "ScheduleExpression": "cron(0 2 * * ? *)",
            "Lifecycle": {
                "DeleteAfterDays": 30
            }
        }]
    }'

Multi-region cluster restoration requires coordination between AWS Backup and DSQL APIs, and it's currently only supported via CLI (not the console). Plan your disaster recovery procedures accordingly, and test them. The restore process creates new clusters rather than in-place restoration, so test a lot.

Security Configuration for Production

VPC endpoints are essential for production deployments of anything that doesn't live in your VPCs, and Aurora DSQL (unlike regular Aurora) doesn't go into your VPC. DSQL requires separate endpoints for management operations and database connections:

# Create VPC endpoint for database connections
aws ec2 create-vpc-endpoint \
    --vpc-id vpc-12345678 \
    --service-name $(aws dsql get-vpc-endpoint-service-name \
                    --cluster-identifier your-cluster-arn \
                    --query ServiceName --output text) \
    --vpc-endpoint-type Interface \
    --subnet-ids subnet-12345678 subnet-87654321 \
    --security-group-ids sg-12345678 \
    --private-dns-enabled

Migration Planning Reality Check

Here's the hard truth: DSQL isn't a drop-in replacement for PostgreSQL, despite the compatibility marketing. Successful migrations require systematic assessment and application refactoring.

Assessment framework:

  1. Dependency Analysis: Catalog usage of unsupported features (foreign keys, triggers, sequences, PL/pgSQL)

  2. Transaction Pattern Review: Identify long-running transactions, mixed DDL/DML operations, large batch operations

  3. Query Complexity Evaluation: Find queries that might hit transaction time limits or benefit from different indexing strategies

  4. Connection Management Audit: Assess current pooling strategies and connection lifecycle management

For applications heavily dependent on PostgreSQL-specific features, consider DSQL for new microservices rather than migrating existing monoliths. The serverless scaling and operational simplicity make it excellent for greenfield projects designed with its constraints in mind.

The learning curve is real, but once you adjust to the OCC model and embrace the serverless operational paradigm, DSQL delivers on its scalability promises. Just don't expect it to be PostgreSQL with better marketing (because it's not, and that's actually the point).

Did you like this issue?

Login or Subscribe to participate in polls.

Reply

or to participate.