Amazon Bedrock Prompt Caching

I spend a lot of time working with large language models in production, and one persistent problem is that they're expensive 💸 . Prompt caching can help a lot with that, especially when you're reusing the context in a conversation (something that we take for granted in ChatGPT or Claude.ai, but is actually expensive).

In this article, we're going deep into Amazon Bedrock prompt caching and how to use it. We'll talk about the architecture, checkpoint mechanics, cache lifetime management, API integration, and the economics of it all.

Technical Architecture of Bedrock Prompt Caching

Prompt caching isn't just a simple text storage mechanism. There's a whole infrastructure dedicated to preserving neural network states.

When a large language model processes text, it's not just reading characters, it's building an internal representation of everything it reads. This includes attention patterns, token relationships, and other neural activations. Normally, when you send a prompt, the model rebuilds this entire representation from scratch every time, even if 90% of the prompt is identical to previous requests.

What actually happens with Bedrock prompt caching is that the service captures this internal neural state at specific points (checkpoints) and stores it in an ephemeral cache. This is fundamentally more complex than just storing the text string, it's preserving the actual computed representation inside the model.

The cache itself lives in AWS-managed infrastructure, completely isolated within service boundaries to maintain security between different AWS accounts. Your cached content isn't accessible to other users, and their cached content isn't accessible to you. The architecture maintains security by keeping the cached state tied to your specific Bedrock resources and API credentials.

The caching layer integrates directly with Bedrock's execution environment. If you're familiar with how AWS Lambda works, you can think of a similar execution model. Just as Lambda runs on AWS Lambda Workers (which are essentially EC2 instances), Bedrock also runs on compute infrastructure optimized for inference. The prompt caching system sits between your API calls and this infrastructure, intercepting and storing neural states when instructed.

From a performance standpoint, setting up a cache creates a small overhead on first write. But that investment pays off dramatically on subsequent reads, with massive reductions in processing time and cost. The whole system is optimized for bursts of activity with similar prompt content, which fits the pattern of many LLM applications like conversation agents, document analysis tools, coding assistants, or anything where you keep asking things about something you already shared.

Cache Checkpoint Mechanics

Cache checkpoints are the actual mechanism that makes prompt caching work. But what exactly is a checkpoint? It's a specific position in a prompt where Bedrock saves the model's entire internal state up to that point. Think of it as a bookmark in the model's thought process, the model can jump back to that exact state later without reprocessing everything that came before.

You can't just place checkpoints anywhere you want. They follow specific token thresholds that vary by model. For Anthropic's Claude 3.5 model, you need approximately 1,024 tokens of combined conversation before you can set the first checkpoint. Why this minimum threshold? Because the overhead of caching very small prompts would outweigh the benefits, there wouldn't be enough tokens saved to justify the cache management cost.

Different models have different checkpoint configurations:

  • Claude 3.5 allows up to 4 checkpoints per conversation

  • Amazon's smaller Nova models typically support just 1 checkpoint

  • The placement options vary by model, some allow checkpoints in system messages, user messages, and tool sections, while others restrict placement

The checkpoint creation process is pretty fascinating if you understand what's happening during model inference, which is why I'm diving so deep into this stuff (I had a lot of fun researching it!). When Bedrock identifies a cache point in your prompt that meets the token threshold, it does something akin to taking a snapshot of the entire neural network's state at that moment. This state captures all the attention patterns, embeddings, and internal representations that the model has built up to that point. This complex state is extracted and stored with a unique identifier associated with your AWS account and the specific model.

I should emphasize one technical limitation: checkpoints are tied to the exact prefix they follow. If you change even a single character before the checkpoint, the cache can't be used. Why? The model's internal state depends on the precise sequence of tokens it has processed, any change, no matter how small, would result in a different neural state. This strictness is necessary because the model's computations (not output) are deterministic based on the input.

Another important detail is how checkpoints interact with the model's context window. Cached content absolutely still counts toward the total context window for the model. If you cache a 2,000-token document and the model has a 100,000-token context window, you'll still have only 98,000 tokens available for additional content before the model starts “forgetting” stuff and quality goes to the floor. The cache doesn't magically expand the context window, it just prevents redundant processing of the same tokens, and prevents you from having to pay for that redundant processing.

Cache Lifetime and Management

In Bedrock's implementation, the prompt cache is ephemeral by design, with a default Time To Live (TTL) of 5 minutes. The TTL timer works on a sliding window basis, each time the cache is successfully hit (used), the 5-minute timer resets. This keeps frequently used cache entries alive while allowing rarely used ones to expire naturally. Why 5 minutes specifically? It's likely the AWS team found this to be the sweet spot that covers most interactive use cases without wasting storage on infrequently accessed data. I'll ask them if I get the chance.

When a cache expires, the stored model state is discarded completely. There's no way to retrieve it after expiration, and no background persistence occurs. If you need the same content again after expiration, you'll need to reprocess the full prompt and recreate the cache from scratch.

What's particularly interesting about Amazon's implementation is that it's clearly optimized for interactive workloads rather than long-running processes. The 5-minute window aligns with typical human interaction patterns in conversation systems and document analysis workflows, where bursts of activity occur within short timeframes. For systems with longer idle periods between related requests, you'll need to implement strategies to manage cache retention, such as periodic "keepalive" requests or prompt restructuring to minimize the impact of cache misses. Yes, both of these are real things we do in prod.

Bedrock API Integration

Understanding how to integrate prompt caching with Bedrock's APIs is where the rubber meets the road. There are three primary integration points: the Converse API for multi-turn conversations, the InvokeModel API for single-turn completions, and automatic integration with Bedrock Agents.

For the Converse API, which handles chat-style interactions, you mark cache points within the message structure. Here's a Python example:

import boto3

bedrock = boto3.client("bedrock-runtime")
document_text = "Very long reference document that you don't want to process repeatedly..." 

messages = [{"role": "user", "content": []}]
# Add the long context to the user message
messages[0]["content"].append({"text": document_text})
# Mark cache checkpoint after the large context
messages[0]["content"].append({"cachePoint": {"type": "default"}})
# Add the user's actual question after the checkpoint
messages[0]["content"].append({"text": "What are the key points in this document?"})

response = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-v2:0",
    messages=messages
)

This structure tells Bedrock to process the document_text once, cache the resulting model state, and then process the question. On subsequent requests you can reference the same cached state without resending the document. The beauty of this approach is that it integrates directly with the conversation structure without requiring special caching APIs.

For the InvokeModel API, which handles single-turn prompts, caching is enabled by including appropriate fields in the request body. The exact structure depends on the model being used. For Anthropic models, it typically involves adding a cache_control field:

response = bedrock.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-v2:0",
    body=json.dumps({
        "prompt": "System: You are a helpful assistant.\n\nHuman: " + document_text,
        "cache_control": {"type": "ephemeral"},
        "max_tokens": 1000
    })
)

Here's a trick that's not immediately obvious: To verify that caching is working properly, you should examine the response metadata. Bedrock includes cache-related metrics in the usage data:

# Extract usage metrics from response
usage = response["usage"]
cached_read_tokens = usage.get("CacheReadInputTokens", 0)
cached_write_tokens = usage.get("CacheWriteInputTokens", 0)

print(f"Tokens read from cache: {cached_read_tokens}")
print(f"Tokens written to cache: {cached_write_tokens}")

On the first request you'll see a high value for CacheWriteInputTokens (the tokens being cached) and zero for CacheReadInputTokens. On subsequent requests that use the same cached content, you'll see the opposite pattern, high CacheReadInputTokens and zero CacheWriteInputTokens. This provides a clear signal that the caching system is working as expected.

For Bedrock Agents, enabling prompt caching is even simpler, you just toggle a setting in the agent configuration:

response = bedrock_agent.update_agent(
    agentId="your-agent-id",
    promptOverrideConfiguration={
        "promptCachingEnabled": True
    }
)

With this setting enabled, the agent automatically manages cache checkpoints without requiring any additional code in your application logic. This is particularly useful for complex agent workflows where manually placing checkpoints would be cumbersome.

How Much Does Bedrock Prompt Caching Cost?

Now let's talk about money. The economics of prompt caching are where this feature really shines for applications with repetitive content. Here's the detailed cost structure:

  1. Cache Write (First-time Processing): When content is processed and written to the cache for the first time, you pay a small premium over regular processing. For third-party models like Anthropic's Claude on Bedrock, the cache write cost is approximately 25% higher than the standard input token price. For Amazon's own models, there's currently (as of March 2025) no extra charge for cache writes.

  2. Cache Read (Subsequent Reuse): The big savings come from cache reads. When you reuse cached content, you pay only about 10% of the normal input token price, a 90% discount compared to processing those tokens from scratch.

  3. Storage Costs: There are no separate storage fees for keeping data in the cache. You only pay the read/write token fees described above.

This pricing structure reflects the real computational cost difference between processing tokens from scratch and reusing pre-computed representations, which I talked in detail about at the beginning of the article.

Let's see some example numbers.

Imagine you're building a financial document analysis system that allows users to upload quarterly reports (average 30,000 tokens) and ask multiple questions about them. Each user session involves around 8 questions about the same document. Without caching, each question would require reprocessing the entire document.

Let's calculate the costs using the newly-released Claude 3.7 Sonnet pricing ($0.003 per 1,000 input tokens). Claude 3.5 Sonnet has the same price.

Without Caching:

  • Document processing per question: 30,000 tokens × $0.003/1000 = $0.09 per question

  • Total for 8 questions: 8 × $0.09 = $0.72

With Caching:

  • First question (cache write): 30,000 tokens × $0.003/1000 × 1.25 = $0.1125

  • Subsequent 7 questions: 7 × 30,000 tokens × $0.003/1000 × 0.1 = $0.063

  • Total: $0.1125 + $0.063 = $0.1755

That's a 75.6% cost reduction for a single user session. Now scale that to an enterprise with thousands of users and documents, and the savings add up to a lot. For a system processing 10,000 documents per month with this pattern, you'd save over $5,400 monthly.

But it's not always that simple. Some scenarios where caching might not help much:

  • Single-use content: If each document is only analyzed once, the cache write premium actually increases your cost by 25% 🫠.

  • Tiny prompts: For very small prompts below the minimum token threshold caching doesn't activate, so you don't save anything (nor pay extra).

  • Long gaps between requests: If users typically wait more than 5 minutes between questions, the cache expires and you lose the benefit. Of course you still pay for the cache write, so again your costs increase by 25%.

  • High cache miss rate: If your application frequently generates slightly different versions of prompts that can't share cache entries, you'll pay for cache writes without getting read benefits.

To effectively measure your actual savings, implement logging for the CacheReadInputTokens and CacheWriteInputTokens metrics from your API responses. Over time, this data can help you optimize your caching strategy and quantify the ROI.

There's another economic benefit worth mentioning: by reducing response latency, caching can also improve user experience. This secondary economic benefit doesn't show up directly in the token costs but can significantly impact your app, especially since end users’ most common complaint about AI applications (in my experience) is that they take too long to respond.

Performance Optimization

As I mentioned, prompt caching can have a significant impact on performance. According to AWS, cached content can be processed up to 85% faster than uncached content. This translates directly to lower latency for your users.

The performance gain scales with the size of the cached content, following an interesting pattern. Caching a few hundred tokens might save tens of milliseconds, while caching thousands of tokens can reduce response times by seconds or even tens of seconds for very large prompts. This non-linear relationship occurs because the token processing time in LLMs isn't perfectly linear, there are fixed overheads and optimizations that vary based on the total workload.

Here's a technical insight you might not find in the documentation: the performance benefit isn't just from skipping token processing, it also comes from avoiding the initial model loading and warmup. When a model starts processing a prompt, there's a "ramp-up" period where tensor operations aren't fully optimized. By jumping straight to a cached state, you skip this ramp-up, giving an additional performance boost beyond the raw token processing time.

Tips to optimize prompt structure for maximum performance

  1. Place cache checkpoints at logical boundaries in your prompt, such as after system instructions or reference documents but before user queries

  2. Ensure that static content comes before dynamic content in your prompts

  3. Meet the minimum token threshold for your model before inserting a checkpoint

  4. Structure multi-turn conversations to leverage previously cached content

A pitfall I've encountered is what I call "cache fragmentation", creating slightly different versions of similar prompts that can't benefit from the same cache entry. For example, if you include timestamps or request IDs in your prompt prefix, you'll create a unique cache entry for each request, effectively nullifying the benefits of caching. To avoid this, standardize your prompt templates and ensure that fixed content is consistent across requests.

For example, here's a simple prompt that contains variable data where you'd tend to place it:

You will be presented with 3 articles about AWS. You must answer the user's questions about them. The user is a cloud engineer, so make sure you adopt an appropriate tone in your responses.

See the variables? 3 articles about AWS, and the user is a cloud engineer. So if you later need to send 4 articles instead of 3, or they're about AI instead of AWS, or the user is a sales rep instead of an engineer, you can't reuse this from the cache. Here's how you can rewrite it instead:

You will be presented with some articles about a specific topic, which I'll specify at the end of this prompt. You must answer the user's questions about them. The user has a certain role, which I'll specify at the end of this prompt, so make sure you adopt an appropriate tone in your responses.

[checkpoint]

topic: AWS
user role: cloud engineer

Now you can put a checkpoint at [checkpoint] and reuse everything that comes before it as a cached prompt. Of course this example is way too brief to cache, but I hope you get the idea.

Implementation Patterns for Common Scenarios

Different use cases require different implementation approaches for prompt caching. Let's look at some common patterns I've found effective:

Multi-Turn Conversations

For conversational applications, cache the system prompt and conversation history to avoid reprocessing previous exchanges. In a multi-turn implementation:

def handle_conversation(conversation_id, user_message):
    # Retrieve conversation history
    history = get_conversation_history(conversation_id)
    
    # Structure messages with cache points
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "system", "content": {"cachePoint": {"type": "default"}}}
    ]
    
    # Add previous exchanges with a cache point after history
    for msg in history:
        messages.append(msg)
    
    if history:
        messages.append({"role": "user", "content": {"cachePoint": {"type": "default"}}})
    
    # Add the new user message
    messages.append({"role": "user", "content": user_message})
    
    # Call Bedrock with caching enabled
    response = bedrock.converse(
        modelId=MODEL_ID,
        messages=messages
    )
    
    # Save the exchange to history
    update_conversation_history(conversation_id, 
                               {"role": "user", "content": user_message},
                               {"role": "assistant", "content": response["output"]})
    
    return response["output"]

This approach places checkpoints after the system prompt and conversation history, allowing the model to skip redundant processing of earlier messages. One thing to watch out for is the context window limit, if the history grows too large, you'll need to summarize or truncate it while maintaining the cache checkpoints. And if you summarize it, make sure you drop the cache by starting a new conversation! I've made that mistake of adding the summary but also keeping the old stuff 🤦.

Document Q&A

For document-centric applications (i.e. the first thing that comes to mind when I say RAG), cache the document content to enable fast querying:

def document_qa(document_id, query):
    # Retrieve document content
    document = get_document_content(document_id)
    
    # Structure request with document caching
    messages = [
        {"role": "user", "content": [
            {"text": "I want to ask questions about this document:\n\n" + document},
            {"cachePoint": {"type": "default"}},
            {"text": query}
        ]}
    ]
    
    # Call Bedrock with caching enabled
    response = bedrock.converse(
        modelId=MODEL_ID,
        messages=messages
    )
    
    return response["output"]

A non-obvious optimization here: for very large documents, you might need to split them into chunks with multiple cache points. Claude 3.5 supports up to 4 checkpoints, so you could structure a long document with checkpoints after each quarter, allowing partial reuse even if some sections change. If nothing changes, don't bother, one huge checkpoint and 4 smaller checkpoints have the same performance.

Coding Assistant

For coding assistants that analyze codebases:

def code_assistant(repository_id, file_paths, query):
    # Retrieve code files
    code_context = ""
    for path in file_paths:
        code = get_file_content(repository_id, path)
        code_context += f"File: {path}\n```\n{code}\n```\n\n"
    
    # Structure request with code caching
    messages = [
        {"role": "system", "content": "You are a coding assistant that helps with programming tasks."},
        {"role": "user", "content": [
            {"text": "Here is the code to analyze:\n\n" + code_context},
            {"cachePoint": {"type": "default"}},
            {"text": query}
        ]}
    ]
    
    # Call Bedrock with caching enabled
    response = bedrock.converse(
        modelId=MODEL_ID,
        messages=messages
    )
    
    return response["output"]

An important note for code analysis: because code often changes incrementally, consider version-specific caching strategies. For example, you might include a git commit hash in your cache key to ensure you're not using outdated cached representations for modified code.

To be honest I haven't found a good way to use cache for this. My conclusion is that I'd need to be able to predict which files are less likely to change so I can place them in the cache. This sounds partially doable with some pre-processing of the files and the user's query, but I haven't tested it, nor explored other ideas. Too busy writing!

Debugging Cache Issues

When things go wrong with prompt caching, diagnosing the problem can be tricky without knowing what to look for. Here are the most common issues I've encountered and how to resolve them:

Identifying Cache Misses

The first step in debugging is determining whether your cache is being used at all. The most reliable method is to check the response metrics:

def is_cache_hit(response):
    usage = response.get("usage", {})
    read_tokens = usage.get("CacheReadInputTokens", 0)
    return read_tokens > 0

If this function returns False, your cache isn't being hit. The most common reasons are:

  1. Cache expiration: The 5-minute TTL elapsed between requests

  2. Prompt mismatch: The prefix text doesn't exactly match the cached version

  3. Token threshold not met: You're trying to cache a segment smaller than the minimum requirement

  4. Cache point not properly placed: The cachePoint marker is missing or incorrectly formatted

Conclusion: Please Use Prompt Caching Often (not always)

Prompt caching should be a standard consideration in your design process. It's particularly valuable for applications with these characteristics:

  • Multi-turn conversations with consistent system prompts

  • Document-centric analysis where users ask multiple questions about the same content

  • Coding assistants that need to reference the same codebase repeatedly (though this needs a bit more work)

  • Any workflow where large static context is combined with smaller dynamic queries

When not to use it: Anything that doesn't use the converse API, i.e. anything that doesn't reference older messages.

Tips when using caching:

  • Design your prompts with caching in mind from the beginning. For example, move variables to the end.

  • Monitor cache performance and cost metrics to validate your approach. Remember that writing the cache costs 25% more, so if you're not benefitting you're overpaying.

  • Plan for cache misses and expirations. They will happen more than you think.

That's it. Sorry I took a bit long to say that, but I hope you found the inner workings interesting!

Did you like this issue?

Login or Subscribe to participate in polls.

Reply

or to participate.