- Simple AWS
- Posts
- Amazon Bedrock Deep Dive: Building and Optimizing Generative AI Workloads on AWS
Amazon Bedrock Deep Dive: Building and Optimizing Generative AI Workloads on AWS
AWS is mainly about two things: scaling effortlessly and offloading undifferentiated heavy lifting (gotta love those marketing terms 🤣). Amazon Bedrock attempts to do that for generative AI: delivering a fully managed service that offers pre-trained foundation models, plus the ability to customize, integrate, and scale them in production. I would even call it serverless, since you pay per use ($/token for natively supported models) instead of per hour of servers with GPUs.
But as with many “serverless” concepts in AWS (think AWS Lambda), there’s an underlying architecture and operational nuance you absolutely need to understand to build robust, high-performance and simply good (or at least not terrible) applications. So let's go beyond the marketing pitch and talk about how Bedrock works. We’ll be discussing topics like:
Foundation Models and how to fine-tune them.
Bedrock Knowledge Bases for Retrieval-Augmented Generation (RAG).
Agents that orchestrate multi-step tasks and external API calls.
Performance and cost optimization strategies for large-scale usage, including some pricing analysis.
Security and compliance patterns, encryption, PrivateLink, IAM, VPC endpoints, etc.
I've also included a “continuous” example in an attempt to tie everything together, imagining throughout the article how a company might build a flexible AI platform using Bedrock.
Foundation Models and Fine-Tuning
The term “foundation model” can mean different things in different contexts, but specifically for AWS it means pre-trained Large Language Models (LLMs) and other types of models from different companies like Amazon themselves, Anthropic, AI21 Labs, or Stability AI. Bedrock gives you an environment and API where you can use these models and pay per use (tokens), without managing the compute clusters where they are actually running.
Picking a Model for Your Use Case
Depending on your scenario, you might opt for a smaller or larger model. Maybe you’re building a chatbot that handles routine queries about HR policies. A smaller text model can handle that just fine, and while you don't exactly care about model size since you're not running the servers, smaller models are typically much cheaper in $/token. On the other hand, if you need something more creative or robust in complex dialogues, like writing nuanced policy briefs, drafting marketing copy, or writing an article like this one (I still can't find a model that comes even close!) you might choose a bigger model with hundreds of billions of parameters, which can better understand context and produce more varied and better output.
As I mentioned, size doesn't matter directly. What you really care about is price per tokens, what modalities it supports (e.g. can it understand images, produce images, understand video, etc), and how “smart” or “good” it is at those tasks. There are benchmarks out there, but they're generic and usually not that real (model providers typically train their models to do exceptionally well on those benchmarks, so their models appear better than they are). So in the end you just need to test a few options and see which one works best for you.
Bedrock Under the Hood: GPU Abstractions and Data Privacy
In practice, these models are deployed on GPU clusters that AWS manages. When you call Bedrock’s InvokeModel
endpoint, your request is routed to the appropriate set of GPU instances to process the token stream. AWS does an excellent job of abstracting you away from that compute environment, and it handles versioning, patching, and auto-scaling those GPU resources. Moreover, your data never mixes with other customers’ data, so you get strict isolation.
One key point is the data privacy guarantee. Whenever you fine-tune or customize a foundation model, AWS spins up a private copy for your account. That specialized version is not shared globally. This approach solves a common enterprise fear: “If I upload my proprietary data, will it show up in some random person’s conversation?” Short answer: no. Long answer: noooooo.
Fine-Tuning Workflows
Suppose you have a repository of domain-specific text, like all the internal procedures or specialized knowledge from your financial services company. Base LLMs are not trained on it, and while you can use RAG techniques (we'll talk about that in the next section), the knowledge might not fit the model's context window, or RAG might not provide the desired quality. That’s where fine-tuning comes in:
Data Gathering and Preparation: Typically you drop your labeled data (could be text pairs for Q&A, or examples of desired text completions) into an S3 bucket.
IAM Role Configuration: You create an IAM role that grants Bedrock the ability to read from your S3 bucket. At the same time, you’d ensure strict scoping so it only accesses the relevant data.
Initiating the Fine-Tune Job: Through either the AWS console or an API call, you kick off a fine-tuning job referencing the base model and your S3 data. AWS spins up the required GPU resources and does the fine-tuning.
Private Model Copy: Once training completes, you end up with a custom model ARN that you alone can invoke. All the artifacts remain locked down.
If you need finer control of the process, like handling partial epochs or using advanced hyperparameters (tip: if you don't know what these are, you don't need them), you'll have to go to SageMaker, since Bedrock abstracts that away from you. In Bedrock your control is mostly limited to what training data you supply. The bright side is that you don't need to manage the compute environment.
"Continuous" Example Part I: Building a Knowledge-Heavy Financial Advisor Model
Imagine you work at a mid-sized bank. You’ve got thousands of pages of regulatory guidelines, internal rules, and Q&A docs that your employees constantly reference. The first step in building an in-house “Financial Advisor” AI is to pick a suitable text-based foundation model. You choose one of the Amazon Nova models for cost and performance reasons, then begin fine-tuning it with your annotated Q&A data stored in S3. After two or three test runs, you find a sweet spot of about 5,000 examples that significantly reduce hallucinations. You take the resulting custom model ARN and store it in your imagination for future steps.
Knowledge Bases and Retrieval-Augmented Generation (RAG)
One of the biggest pain points with LLMs is the data it was trained on. It can likely recite Shakespeare or the AWS documentation (or the AWS docs in Shakespeare style I guess, I haven't tried), but if you ask it about your company's private data, it (hopefully!) won't know anything about it. Fine-tuning a model with that data is an option, but it's the more expensive one. If you just need it to access a few documents at a time, you should use Bedrock Knowledge Bases.
Before you ask, yes, you can pack up your entire corporate wiki in a prompt (I've literally done that in a couple of projects), but you're limited by the model's context window, which is how many tokens it can hold in its “memory”. State of the art models have a total context window of 200k to 300k tokens, and a “good” context window of around 100k tokens (beyond the “good” context window, the model still remembers the information but performance degrades significantly). Moreover, if you pass 50k tokens as the input of every prompt and let the model find the relevant information there, it will very likely do a good job at it (much better than if you pass 150k or 500k tokens), but you'll have to pay for 50k input tokens! 💸
So instead of passing all the documents in a prompt, you can store the relevant information in a dedicated vector store. When a user asks a question, the system runs a vector similarity search (using embeddings) and attaches the best pieces of content to the prompt. Then the LLM can generate the answer as if you had pasted only the relevant articles from your corporate wiki instead of the entire thing, which of course takes up a lot less space in the context window and a lot less $ for input tokens. This process is commonly called Retrieval-Augmented Generation, or RAG if you like acronyms (we all do).
One thing to note is that, while the form of RAG I just described is by far the most common one, it's technically just one form of RAG among many. RAG as a whole consists of retrieving additional information (i.e. not contained in the prompt or the model's training data) and using it to generate the response. The information can be stored in a vector database and can be used via Bedrock Knowledge Bases, or it could also come from an API, a regular (i.e. non-vector) database, reading files from S3, or even the internet (yes, web search and even ChatGPT's Deep Research feature is a form of RAG).
Internal Mechanics of Bedrock Knowledge Bases
When you create a Knowledge Base in Bedrock, you specify data sources, such as an S3 bucket with text files or an existing OpenSearch cluster (tip: use OpenSearch Serverless). Bedrock will parse these documents, split them into manageable “chunks,” and generate embeddings using a specialized model like Titan Embeddings. Each chunk is stored in a vector index. Then, at query time, you do something like:
POST /bedrock/knowledge-bases/<knowledgeBaseId>/retrieve-and-generate { "userQuery": "Explain the new corporate compliance rules for personal loans.", "maxDocuments": 3, "modelId": "your-custom-model-arn" }
Bedrock looks up the top 3 documents that match the user query’s embedding, appends that text as context, and calls your custom LLM. The LLM sees the appended context, and ideally produces a grounded, accurate answer that references the compliance rules. This approach drastically reduces the classic “hallucinations” problem where the model makes stuff up when it doesn't have information. Note that the effect of this form of RAG is almost the same as if you yourself had looked up those documents and attached them to the prompt, except of course it's automated.
Advanced RAG Considerations
Index Size and Update Frequency: If your documents change frequently, you’ll want to configure incremental indexing or schedule updates. If your index is huge, like half a million text chunks, then you have to consider retrieval latency. That can be tuned by selecting an appropriate underlying vector engine or letting Bedrock create an Amazon OpenSearch Serverless domain behind the scenes.
Embedding Strategies: Bedrock’s default embeddings are pretty great, but in specialized domains (like medical or legal), you might consider training your own embedding model to capture domain-specific language. For advanced use, you can integrate a custom embedding pipeline in SageMaker and then push those embeddings into the Knowledge Base.
Filtering and Metadata: It helps to tag each chunk with metadata (e.g., “compliance,” “year:2023,” “region:EU”) so that queries can be further filtered if needed. This ensures the retrieval engine returns the most relevant snippets.
"Continuous" Example Part II: Adding Knowledge Bases to Our Financial Advisor
Continuing our scenario, you now have a custom model that’s aware of your bank’s style and partially understands your domain. But your bank’s compliance docs are too large to embed into every single prompt. So, you create a Knowledge Base that ingests those regulatory PDFs from S3. Each PDF ends up split into chunks of 500 to 1000 characters. Bedrock indexes them, storing embeddings behind the scenes. Next, you configure your “Financial Advisor” application to use the Knowledge Base’s retrieval endpoint whenever an employee asks about a policy. Now the model can reliably reference the official guidelines, giving correct and up-to-date answers.
Agents for Amazon Bedrock, Multi-Step Reasoning and External Integrations
You might have seen demos of AI chatbots that can retrieve information, call external services, or complete multi-step tasks automatically. This is typically achieved with an Agentic Architecture, and the feature Agents for Amazon Bedrock makes this possible natively in AWS. Unlike a standard “prompt in, response out” approach, an agent can be granted permission to call certain APIs or functions, chain multiple thoughts, use “tools”, call other agents, and incorporate results back into the conversation.
How Agents Work Internally
Think of an agent like a mini-orchestrator running on top of a large language model. When a user query comes in, for example “Please check the latest credit approval for customer ID 12345, then email me the result”, the agent can break that down into multiple steps:
Interpret the query.
Determine it needs to call a “CheckCreditApproval” function with parameter
customerId=12345
.Fetch the result.
Format a summary email.
Unlike a classic monolithic LLM workflow, the agent can parse each step and verify or refine them as it goes. This is particularly powerful in automation contexts, where the AI’s utility depends on getting real-time data or performing external actions.
Security and Guardrails for Agents
One of the biggest concerns with “agentic AI” is letting the model run wild. Imagine if you grant it permissions to run shell commands on your computer, and it runs rm -rf /
. That’s why you define action groups in Bedrock. An action group might contain things like:
An API to query internal customer data.
A function to format a PDF, stored in a secure internal system.
A process to update a record in a CRM.
The model can only invoke what’s in the action group, and you set up IAM policies to limit each function’s scope. For instance, you might allow read-only operations on the customer database to one agent, but never “delete” actions. Logging and monitoring are also key. You can capture every call the agent makes to ensure it’s not stepping outside compliance boundaries.
Different Agent Archetypes
A supervisor agent is the decision-maker. It might coordinate multiple sub-agents if you have a very complex workflow, like verifying user identity, retrieving data from three separate systems, and then deciding if an action is authorized. Specialized agents just handle a single specialized job, like summarizing a call log from Amazon Connect.
You don't strictly need a supervisor agent and multiple specialized agents. Just the action groups are worth the use of Agents for Amazon Bedrock. However, in complex workflows the separation of concerns between multiple agents make things easier to manage. I equate them to microservices: You likely don't need them, until your thing grows so big that splitting the complexity makes it more manageable even if you introduce a bit more complexity in the process.
"Continuous" Example Part III: Adding an Agent to Our Financial Advisor
Our bank’s “Financial Advisor” solution gets advanced when we introduce an agent. Suppose a branch employee types: “What’s the credit limit for user ID 12345? Then send them an email with the updated policy highlights.” The agent queries the Knowledge Base for policy highlights, calls an internal “getCreditLimit” API with that user ID, merges the results, and composes an email. Because we meticulously defined an action group that includes only read rights for user data, plus an email-sending function, we can ensure it never writes or modifies sensitive data. It also logs each step, so security teams can track usage.
Amazon Bedrock Security and Compliance
Bedrock stands on the same security foundations as the rest of AWS, layering in features that handle enterprise-grade security and data isolation. I mentioned earlier that custom model artifacts remain private, but that's just the start.
AWS IAM Policies and Service Roles
At a high level, you have two sets of permissions:
Who or what can perform actions such as invoke your model, create knowledge bases, or manage them.
What data sources your model can read.
The IAM Principal you're using (e.g. an IAM User or Role) needs to have permissions to invoke a model in Bedrock. For example, you might have an IAM role that exclusively allows bedrock:InvokeModel
on your custom model ARN. Another role might let you manage knowledge base ingestion jobs but not actually call the model. This separation ensures that your operations team can handle content ingestion while your application only deals with inference, limiting the blast radius of any accidental (or malicious) usage.
Additionally, some features like Bedrock Agents and batch inference require you to assign a service role. This role grants Bedrock permissions to execute some actions, such as access an S3 bucket or a Knowledge Base.
VPC Endpoints and PrivateLink
By default, Bedrock endpoints can be invoked over the internet, though always via TLS. For truly locked-down environments, you can enable PrivateLink. This means your requests route through AWS’s internal backbone rather than going out to the public network. You can similarly attach your knowledge base vector store behind a private endpoint, so embeddings never move over the open internet. This stuff is critical in finance, healthcare, or government settings with significant regulations.
Encryption at Rest and Transit
Every chunk of data, be it your fine-tuning set in S3 or the index of your knowledge base in OpenSearch, stays encrypted at rest based on the configurations of that data source. Bedrock uses AWS KMS to manage those keys. If you want more control, you can supply your own KMS key. For encryption in transit, all traffic is HTTPS-only.
Logging and Monitoring
CloudWatch and CloudTrail record which principal invoked your model, how many tokens were processed, and if any errors occurred. You can funnel these logs into a central SIEM, or analyze them in Athena for usage trends. As with many serverless services, you can also choose to store raw request/response logs in S3. Just be mindful that if your data is sensitive, you’ll want to apply the correct encryption and limit S3 access accordingly.
Performance Tuning and Cost Optimization in Amazon Bedrock
Generative AI isn’t cheap yet, even though it has gotten a lot cheaper in the past 2 years. Hitting these models constantly can burn through your bank account really quickly, especially if you’re handling hundreds of thousands of daily requests. Honestly, that's the first lesson: it isn't cheap, period. However, a little knowledge goes a long way in controlling costs.
The Pricing Model of Bedrock
AWS typically charges for each thousand input and output tokens you process, plus an optional overhead for advanced features. For example, for the Amazon Nova Pro model the price for 1,000 input tokens is $0.0008, and for 1,000 output tokens $0.0032. If you opt for provisioned throughput (described below) you would pay $18.40/hour to lock in a certain capacity at a discount.
For fine-tuning, the price to train 1,000 tokens for Nova Pro is $0.008, then to store each custom model it's $1.95 per month, and the price to infer for 1 model unit per hour is $108.15.
Using Bedrock Knowledge Bases will cost you $2.00 per 1000 queries for Structured Data Retrieval (SQL Generation).
Those are just a few numbers, to give you an idea of what you pay for and how much. Make sure to check the official AWS documentation for the current pricing because it tends to evolve over time. The complete table of prices can be found at the Amazon Bedrock pricing page.
Example Cost Calculation
Imagine our financial advisor solution receives 10,000 daily queries, each using an average of 500 input tokens and producing 150 output tokens. That's 5,000,000 input tokens and 1,500,000 output tokens per day. If you use the Amazon Nova Pro model the price for 1,000 input tokens is $0.0008, and for 1,000 output tokens $0.0032. So your daily cost would be $4 + $4.80 = $8.80. Over a month, that might run up to $264.
Of course it's going to be a bit expensive, we're using Amazon's most expensive model! If you use Amazon Nova Lite, a smaller model which costs $0.00006 per 1,000 input tokens and $0.00024 per 1,000 output tokens, our daily cost would be $0.30 + $0.36 = $0.66, and a monthly cost of $19.80.
Reducing Token Usage
One of the simplest ways to reduce costs is to be mindful of your prompt and output lengths. For example, instead of stuffing the entire knowledge base text into your request, rely on RAG to fetch only the relevant chunks. Also, enforce a sensible output length limit if your scenario doesn’t require extremely verbose answers. Sometimes you can shorten the model’s temperature or top_p settings to reduce “rambling” responses.
Reducing the size of prompts can, of course, have unintended side effects on the output. The only real solution for that is good prompt engineering and a lot of testing. There's a concept called Minimum Viable Tokens (MVT), which is essentially how small you can make a prompt or input in general while still producing a viable output. If you started prompting a couple of years ago, you might be used to long, descriptive and very repetitive prompts. Newer models don't benefit as much from huge prompts, and for “reasoning” models like OpenAI's o1 you don't even need to instruct it to “think in steps” (a technique called Chain of Thought, or CoT). So don't be afraid to experiment with shorter and simpler prompts.
Provisioned Throughput for Consistent Workloads
Everything we've discussed so far is for the On-Demand mode, the default mode of Bedrock. If you have a stable volume of requests, you might want to set up provisioned throughput. This approach can yield cheaper per-token rates, and it will guarantee dedicated capacity so you don’t get high-latency responses during traffic spikes on Bedrock. Once you define a certain throughput, AWS reserves the capacity for you. If your workload is unpredictable or spiky, you'll likely prefer paying a bit more per token to avoid paying for idle capacity. Basically just like Reserved Instances.
Distillation for Speed and Cost
Distillation is a process where you use a larger model to generate good question-answer pairs, and train a smaller model to output those answers based on those questions, basically mimicking the performance of the larger model. Distilling a fine-tuned model into a smaller architecture can significantly reduce token processing time and cost. Bedrock can help you automate parts of this process. The resulting “student” model might not capture the full linguistic flair of the “teacher” model, but in many specific settings the difference in quality is negligible compared to the cost savings. Distillation is still somewhat specialized, so if you want maximum control, you might want to do the distillation steps in SageMaker and import the resulting smaller model back into Bedrock. I'd leave this as the last option, mostly because it's a lot more complex.
Monitoring and Scaling
Tip: always set up usage alerts. Tools like AWS Budgets can email you if your monthly spend surpasses a threshold. Meanwhile, you can log metrics in Amazon CloudWatch: keep an eye on invocation counts, token usage, and latencies. For mission-critical apps, consider setting concurrency limits or fallback strategies so your application gracefully handles times when your Bedrock calls might be delayed or throttled. By the way, please use exponential backoff with jitter.
Integrations with Other AWS Services
A pretty typical thing to do is storing user prompts and conversation logs in Amazon DynamoDB to maintain session continuity. Then, each time a user returns, you retrieve their session context from DynamoDB, feed it into the next model invocation (or the knowledge base retrieval), so the conversation feels continuous. Alternatively, you might integrate with Amazon S3 for archiving older logs or large files that users want summarized.
If your app needs an API front door, Amazon API Gateway is your friend. You can set up an endpoint that receives requests from your web or mobile clients, passes them to a Lambda function that orchestrates the call to Bedrock, and then returns the response. For advanced analytics, you could feed logs or partial transcripts into Amazon Kinesis or Amazon OpenSearch for real-time searches or advanced dashboards.
SageMaker deserves a special mention here. It’s a more specialized ML service, perfect if you want granular control over the training environment (e.g., specific GPUs, custom Docker images, specialized algorithms). Bedrock is for those who prefer a pre-baked, serverless environment with minimal overhead, focusing purely on inference and light customization of large models. In some scenarios you might even combine them: train or distill a model in SageMaker, then host it on Bedrock for easy production usage.
Troubleshooting, Common Pitfalls and What to Look Out For
A few things to keep in mind with Bedrock:
1. Token Throttling and 429 Errors
If your concurrency spikes suddenly, you might see throttling errors. Check if you can either scale up provisioned throughput or spread out requests. Also, as usual, implement retry mechanisms with exponential backoff and jitter.
2. RAG Performance Issues
Large knowledge bases might slow down retrieval if not indexed properly. Monitor retrieval latency in logs. If it’s high, consider upgrading or optimizing your vector store.
3. Agent Overreach
Agents can get “creative” if given too many permissions. Always define narrow action groups. Log and review the actions your agent is taking to ensure compliance and just overall safety. Don't let agents anywhere near the nuclear launch codes, please.
4. Encrypted Data Handling
If your data in S3 is KMS-encrypted with a custom key, make sure your Bedrock role has the permissions to use it to decrypt. A missing permission can silently break your knowledge base ingestion process.
5. Unexpected Model Behavior
Sometimes your model might produce odd answers, especially if you keep prompting it without context. Double-check that your knowledge base retrieval is appended to the prompt and that you’re passing the correct model version ARN. Also confirm you’re not exceeding maximum prompt length (which might truncate important instructions).
Expanding The "Continuous" Example of an End-to-End Financial Advisor
Bringing it all together, here’s how the final architecture might look in a real deployment:
User Interaction (Frontend or Chat Interface)
An employee (or even a customer) interacts with a web or mobile chat UI. This UI hits your internal API gateway.API Gateway and Lambda Orchestration
A request flows into AWS API Gateway, which triggers a small Lambda. This Lambda checks whether the request is straightforward (just a quick question) or if it requires multi-step logic. Note: This could also be done with Intelligent Prompt Routing, but this article was getting too long.Bedrock Knowledge Base Retrieval
If it’s a question about compliance or policy, the Lambda calls the knowledge base’sretrieve-and-generate
method, referencing your custom model ARN. The system fetches relevant chunks from the vector index, appends them to the prompt, and returns a good answer.Agent Invocation for Complex Tasks
If the user’s request includes an action like “update user’s credit limit,” the Lambda instructs a Bedrock Agent that has an action group for read-only checks plus an “updateCreditLimit” function (though maybe you can put a strict approval workflow in place). The agent calls those steps in sequence.Response and Logging
The final answer is returned to the user interface. Meanwhile, the system logs the session context in DynamoDB, lumps extended transcripts in S3 for analytics, and you track usage metrics in CloudWatch.
In our imagination we've built a “Financial Advisor Copilot” that can retrieve official policy, interpret user requests, and even carry out certain tasks. By combining a fine-tuned LLM with knowledge bases, agent orchestration, strict IAM scoping, and cost monitoring, we've successfully imagined a robust enterprise AI solution.
Conclusion and Next Steps
As I mentioned at the beginning, the whole idea of Bedrock is to just let you use AI and pay per usage, without needing to worry about the underlying environments. When it was launched the situation was a bit weird: Since you could do the same thing with the model providers’ APIs, it was just a model aggregator. But as they kept launching features like Agents and Knowledge Bases, they started painting a much bigger picture.
Nowadays Bedrock can still be a model aggregator if you want, but the real value is in all the added features: Distillation, Agents, Knowledge Bases, security features, etc. One problem with it trying to do everything is that it's mostly succeeding, and that means I can't cover everything about it in a 5,000-word article. But I hope I got the main points across.
Key lessons you should walk away with:
Model Customization is straightforward but limited to high-level parameters. If you need more control use SageMaker.
Knowledge Bases and RAG make your AI more factual by integrating real data sources, drastically reducing hallucinations. Proper chunking, embedding, and indexing can make or break performance. Either way, it's always better than pasting your entire wiki into a prompt.
Agents can manage multi-step logic and external API calls. Be strict about action groups and permissions, you don't want an agent running
rm -rf /
.Performance and Cost is dominated by input and output tokens. Reduce tokens, select models carefully, and always set billing alerts.
Security and Compliance are mostly handled by AWS. Pay attention to IAM permissions, and consider private networking if relevant. The default is pretty secure, but highly regulated environments need a few more steps.
One final piece of advice: AI is evolving much faster than other computer stuff. Better and/or cheaper models come out at least every month, and Bedrock has added a ton of features since its release a year and a half ago (September 28, 2023 to be precise). Get the basics right, and stay at least informed about new stuff.
Did you like this issue? |
Reply