• Simple AWS
  • Posts
  • Partitions, Sharding, and Split-for-Heat in DynamoDB

Partitions, Sharding, and Split-for-Heat in DynamoDB

It's 3 am, your phone buzzes. CloudWatch alarm: DynamoDB throttling. You check the metrics: your table has 10,000 Write Capacity Units (WCUs) provisioned, but you're barely using 3,000. The graphs show plenty of headroom. So why are requests getting rejected?

Hopefully this doesn't happen to you (especially not at 3 am). But if it does, you should know the problem isn't your table's total capacity, but that every DynamoDB table is split into physical partitions, and each partition has a hard ceiling of 1,000 WCUs per second. When one partition key gets hammered, that specific partition throttles, even if your other 99 partitions are sitting idle.

Understanding this partition-level reality is the difference between a system that scales smoothly under load and one that falls apart during your biggest traffic spike of the year. Let's get down to that.

The 1,000 WCU Ceiling You Can't See

DynamoDB's architecture is built on physical partitions. These are the fundamental storage and throughput units that actually hold your data. Each partition is subject to three hard limits:

  • 1,000 Write Capacity Units per second

  • 3,000 Read Capacity Units per second

  • 10 GB of storage

These limits are per partition, not per table. When you provision 10,000 WCUs for your table, DynamoDB distributes that capacity across multiple partitions. If your table has 10 physical partitions, each partition gets roughly 1,000 WCUs to work with (the distribution isn't always perfectly even, but that's the general idea).

Your partition key (PK) determines which physical partition stores each item. If your application writes heavily to a small number of partition keys (maybe you're tracking views on a viral post, or incrementing a global counter, or processing a massive batch of orders for a single customer) those writes all target the same partition. Once that partition hits its 1,000 WCU ceiling, requests start getting throttled with ProvisionedThroughputExceededException errors.

If your CloudWatch metrics show the table is only using 30% of its provisioned capacity, that's because the other partitions are fine, they're not receiving traffic. But DynamoDB can't magically move capacity from an idle partition to a hot one in real time (it can over time, we'll get to that). The 1,000 WCU limit is a physical constraint of the partition itself.

This is called a hot partition, and it's the most common scaling bottleneck in DynamoDB. The table-level capacity you see in the console is essentially a fiction, what matters is the distribution of your workload across partition keys.

Let me give you a concrete example. Imagine you're building a real-time analytics system that tracks page views. You decide to use the PageID as your partition key. Sounds reasonable, right? Most pages get modest traffic. But then one article goes viral on social media, and that single PageID is now receiving 5,000 writes per second. The partition holding that key can only handle 1,000 WPS, the other 4,000 writes get throttled, even though your table has 10,000 WCUs provisioned and the vast majority of that capacity is unused. The table isn't the bottleneck, the partition is.

To make this worse, there's no CloudWatch metric that directly shows you partition-level consumption (AWS added Contributor Insights to help with this, but it's not real-time and requires explicit enablement). You have to infer hot partitions from patterns in your throttling errors and your application's access patterns.

The partition distribution also explains why certain write patterns create more problems than others. If you're loading data where the partition keys are sequential or heavily time-based, like using auto-incrementing order IDs or timestamps as partition keys, you create what's called a "rolling hot partition." All your writes concentrate on the newest partition key values, hammering a single partition continuously while the others sit idle. DynamoDB can't split the load because there's only one active write boundary.

Global Secondary Index Propagation

Here's a related complexity that may catch you off guard: Global Secondary Index (GSI) throttling propagates to the base table. When you write an item, DynamoDB must update the base table and every GSI that includes that item. The total WCU cost is the sum of the base table write plus all GSI updates (each GSI write is calculated based on the projected attribute size, rounded up to 1 KB).

If a GSI is under-provisioned and can't handle the write volume, it throttles. When a GSI throttles, it blocks the base table write even if the base table has plenty of unused capacity. This is why you might see throttling on a table that appears to have headroom: you'd need to look at the GSI metrics. Tip: Always monitor ConsumedWriteCapacityUnits for every GSI, not just the base table.

DynamoDB Adaptive Capacity

DynamoDB has an internal mechanism called Adaptive Capacity that tries to handle uneven access patterns automatically. When Adaptive Capacity detects that a partition is receiving disproportionate traffic, it kicks off a process called Split-for-Heat.

DynamoDB continuously monitors partition usage. When it identifies a partition that's consistently hitting or exceeding its throughput limits, it automatically splits that partition into two new partitions. The items are redistributed based on their sort key (SK) values. This split effectively doubles the available write capacity for that segment of data: two partitions means 2,000 WCUs instead of 1,000.

DynamoDB also handles Split-for-Size. When a partition hits the 10 GB hard limit for storage, it's automatically split regardless of throughput concerns. Between these two mechanisms, DynamoDB can theoretically scale to handle any workload without manual intervention.

The problem is that this takes several minutes to complete.

In case you forgot, CloudWatch metrics are aggregated at one-minute intervals. Auto-scaling (if you're using provisioned mode) requires two consecutive minutes of exceeding the target utilization threshold before it even triggers a scaling event. Then the actual UpdateTable operation takes several minutes to complete as DynamoDB provisions additional capacity and potentially splits partitions. During all of this, your application is throttling.

Burst capacity provides some buffer here. DynamoDB banks unused throughput for up to 300 seconds. For every second your table doesn't use its full provisioned capacity, that unused capacity gets saved in a credits bucket. If you provision 1,000 WCUs but only use 500, you're banking 500 WCUs per second. When a sudden spike hits, DynamoDB can pull from this banked capacity to handle traffic that temporarily exceeds your provisioned limit. This is why you might see brief periods where consumption exceeds provisioned capacity without throttling: you're spending your burst credits. But burst capacity is designed to handle short spikes, and it runs out quickly under sustained load.

For a sustained traffic spike (a product launch, a viral moment, a coordinated attack) this lag time translates directly into dropped requests, failed transactions, and angry customers. Adaptive Capacity will eventually save you, but "eventually" can mean 5-10 minutes of degraded performance (and please tell me you're doing load shedding, otherwise this can turn into a cascading failure).

There are also failure modes where Adaptive Capacity can't help at all. If you're using a Local Secondary Index (LSI), DynamoDB can't split the partition because LSIs enforce a 10 GB item collection limit per partition key. The problem here is that Local Secondary Indexes enforce a 10 GB limit per partition key value: all items sharing a PK, plus their LSI projections, must fit within 10 GB. When this limit is approached, DynamoDB can't split the partition since that would break the LSI's contiguous storage requirement. In my experience it's very rare that you'd hit this limit, but it's important to know it's there.

Design Keys That Don't Create Heat

Before we talk about advanced patterns like write sharding, let's cover designing partition keys that naturally avoid hot spots. The fundamental principle is high cardinality with wide dispersion.

Your partition key should have many possible unique values (high cardinality), and your application's access pattern should distribute writes roughly evenly across those values (wide dispersion). If both conditions are met, DynamoDB can spread your workload across many partitions, and no single partition becomes a bottleneck.

Here are the anti-patterns that create hot partitions:

  • Sequential or auto-incrementing IDs: Using ORDER#1, ORDER#2, ORDER#3 as partition keys means all new writes target the highest number, creating a rolling hot spot.

  • Timestamp-based partition keys: TIMESTAMP#2025-09-29T15:30:00 suffers the same problem: writes cluster around the current timestamp.

  • Low-cardinality attributes: Using Region (when you only have 3 regions) as a partition key concentrates writes on whichever region is most active.

Good partition key choices include:

  • UUIDs or randomly generated IDs: USER#a7f3c891-4d2e-4b29-8f34-d9e21ba7c04f provides excellent distribution. Every user gets a unique partition key with no clustering.

  • Hashed values: If you must use sequential IDs, hash them first. ORDER#{md5(order_id)} spreads sequential orders across the hash space.

  • Composite keys with random components: For time-series data, append a random shard suffix: SENSOR#{sensor_id}#{random_int} (we'll explore this pattern in depth shortly).

Keep in mind that even UUIDs can create temporary hot spots if your write pattern is heavily time-based. UUIDv1 embeds timestamp information, which can cluster writes. UUIDv4 (purely random) is better for high-write scenarios.

Important: Don't use auto-incrementing integers as your partition key for high-volume writes. The newest integer value will always be the hot spot.

Here's what good partition key design looks like in code:

import uuid
import hashlib
import random

# Bad: Sequential, creates hot partition at boundary
def create_order_bad(order_counter):
    pk = f"ORDER#{order_counter}"
    return pk

# Good: Random UUID provides wide dispersion  
def create_order_good():
    pk = f"ORDER#{uuid.uuid4()}"
    return pk

# Better for time-series: Add shard suffix
def create_sensor_reading(sensor_id):
    shard = random.randint(0, 9)  # 10 shards
    pk = f"SENSOR#{sensor_id}#{shard}"
    return pk

# Good: Hash sequential IDs to spread them
def create_order_hashed(order_id):
    hash_val = hashlib.md5(str(order_id).encode()).hexdigest()[:8]
    pk = f"ORDER#{hash_val}"
    return pk

High-cardinality partition keys prevent hot spots for most workloads. But what if your use case inherently concentrates writes? What if you're tracking global counters, processing a viral post's engagement metrics, or handling a flash sale where thousands of customers are buying the same product SKU?

That's when partition key design alone isn't enough, and we need to throw some โ€œsharding magicโ€ into the mix.

Write Sharding in DynamoDB

If you know high-volume traffic is coming, you can't rely on Adaptive Capacity's multi-minute lag time. Ideally your keys naturally spread writes, but if they don't, you need a way to manually control how write operations are spread. That technique is called write sharding.

The idea is that instead of using a single partition key for a high-contention entity, you create N synthetic partition keys by appending a shard identifier. When writing, you randomly select one of the N shards. This spreads the write load horizontally across N physical partitions, giving you N * 1,000 WCUs of capacity instead of being capped at 1,000 for that entity.

The tradeoff is that reads become more complex, since to retrieve the complete data for that entity you now have to query all N shards in parallel (scatter) and then aggregate the results on the client side (gather). You're gaining write throughput, but at the cost of read complexity.

Implementing Write Sharding

import random
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Analytics')

SHARD_COUNT = 10  # Gives us 10 * 1,000 = 10,000 WCU capacity

def increment_page_views(page_id: str, count: int = 1):
    """
    Increment page view counter using write sharding.
    Randomly selects one of N shards to distribute write load.
    """
    # Randomly select a shard (0 to SHARD_COUNT - 1)
    shard_id = random.randint(0, SHARD_COUNT - 1)
    
    # Create sharded partition key
    pk = f"PAGE#{page_id}#SHARD#{shard_id:02d}"
    
    # Atomic increment using UpdateExpression
    table.update_item(
        Key={'PK': pk, 'SK': 'VIEWS'},
        UpdateExpression='ADD view_count :inc',
        ExpressionAttributeValues={':inc': count}
    )

With this design, each write hits a random shard amongst the 10 shards that we have. If you're processing 5,000 writes per second and they're evenly distributed across 10 shards, each shard receives roughly 500 writes per second, which is hopefully manageable with the 1,000 WCUs that each partition has (remember each write consumes 1 WCU per 1 KB of data written, rounded up). If it can't, you can either aim for more shards, or rely on Burst Capacity (brief increase in capacity above 1,000 WCUs) and Adaptive Capacity (using some capacity from other partitions of the same table) while you wait for Split-for-Heat (DynamoDB splitting the shard into multiple physical partitions behind the scenes) to captch up.

Scatter-Gather Query

Reading the total requires querying all shards:

import concurrent.futures
from typing import Dict

def get_page_view_total(page_id: str) -> int:
    """
    Retrieve total page views across all shards.
    Uses parallel queries (scatter) then sums results (gather).
    """
    
    def query_single_shard(shard_id: int) -> int:
        """Query one shard and return its count"""
        pk = f"PAGE#{page_id}#SHARD#{shard_id:02d}"
        
        try:
            response = table.get_item(
                Key={'PK': pk, 'SK': 'VIEWS'}
            )
            return response.get('Item', {}).get('view_count', 0)
        except Exception as e:
            # Log error but don't fail entire operation
            print(f"Error querying shard {shard_id}: {e}")
            return 0
    
    # Scatter: Execute parallel queries to all shards
    with concurrent.futures.ThreadPoolExecutor(max_workers=SHARD_COUNT) as executor:
        futures = [executor.submit(query_single_shard, i) 
                   for i in range(SHARD_COUNT)]
        results = [f.result() for f in concurrent.futures.as_completed(futures)]
    
    # Gather: Sum all shard counts
    total = sum(results)
    return total

The Economic Tradeoff of Sharding

Sharding also affects your DynamoDB bill, not just your sanity code complexity. Reading a sharded counter requires N queries instead of 1, so if you have 10 shards and each query reads 0.1 KB, you're consuming 10 RCUs instead of 1 RCU per read (assuming strongly consistent reads) even though the total data read is well below the 4 KB limit for a single RCU.

If your counter is read 1,000 times per second, sharding increases your read capacity consumption from 500 RCUs/sec to 5,000 RCUs/sec. At $0.00013 per RCU-hour in provisioned mode, that's roughly $285/month versus $28/month in read costs alone. Ouch ๐Ÿ’ธ.

A counter isn't the best example, since in that case you usually pre-compute it (i.e. keep the counter as a separate entity, and update it with strong consistency every time you write new data). But overall it serves to show how much sharding can cost you. Still, it's usually better than getting throttled!

Choosing the Right Shard Count

For most use cases, I suggest you start with N=10 shards. As you get traffic, monitor your CloudWatch metrics (specifically UserErrors for throttling and ConsumedWriteCapacityUnits per shard if you've instrumented that). If you're still seeing throttles, increase N. If your read performance is suffering and throttling is rare, you might be over-sharded and can reduce N. The caveat here is that increasing N is pretty simple: You just change the value in the code and start writing to new shards 11, 12, etc. To decrease N you'll have to rewrite every single entity, since you'll need to replace the value of {shard} in the PK ENTITY#{shard}. And you can't just overwrite that field, PKs are read-only, so you need to atomically create a new object and delete the old one. The script is pretty easy, but it's 1 read, 1 write and 1 delete in a transaction per entity in your table. Ouch again ๐Ÿ’ธ๐Ÿ’ธ.

One more consideration: you don't have to shard your entire table. Shard only the hot entities. If you have a leaderboard table tracking scores for millions of users, but only the top 100 users receive heavy write traffic, shard those 100 partition keys. The rest can use normal, single-partition-key design.

Conclusion

Your table capacity doesn't matter if one partition key gets all the traffic. Physical partitions max out at 1,000 WCUs, regardless of table capacity. Adaptive Capacity helps by automatically splitting hot partitions, but it takes several minutes to kick in, so you need .

The first step is to design partitions well. This is the most critical thing in DynamoDB, actually, so I suggest you read on DynamoDB Design and DynamoDB Transactions. Understanding How DynamoDB Scales is also important. Then, if you need to prepare for traffic spikes, shard your entities.

DynamoDB scales incredibly well, but only if you design for how it actually works, not how you wish it worked (or worse, for how relational databases work). Which is why I somehow end up writing so much about DynamoDB.

Did you like this issue?

Login or Subscribe to participate in polls.

Reply

or to participate.