Simple AWS
Posts
Microservices vs. Agentic AI (Part 3): Operations and Costs

Microservices vs. Agentic AI (Part 3): Operations and Costs

Guille Ojeda
May 04, 2025

In the first two parts of this series we established a clear picture of the fundamental differences between Microservice and Agentic AI architectures. Part 1 traced their origins and motivations, revealing how Microservices evolved to solve software lifecycle and scaling challenges by decomposing applications along business domain lines, while Agentic AI leverages Large Language Model (LLM) breakthroughs to automate complex tasks through autonomous reasoning and action. Part 2 explored the consequences for runtime behavior, contrasting Microservices' reliance on defined APIs, "dumb pipes," deterministic logic, and established patterns for eventual data consistency with Agentic AI's use of intelligent flows, contextual memory, cognitive reliability patterns, and inherent non-determinism.

These foundational and runtime differences inevitably lead to distinct operational challenges and realities. Now, in this third part, we're going to dive into the operations side: How do we effectively operate, scale, and ensure the resilience of these systems in production? What are the real-world complexities surrounding observability, tooling, deployment, and lifecycle management? How do their cost structures differ, and what optimization strategies are available, particularly on AWS? Our goal remains not to prescribe one architecture over the other, but to understand the operational landscape of each, learning from their contrasts to make better-informed design and operational decisions. And to have fun while thinking about complex stuff.

Scale and Resilience Implications

Operating distributed systems at scale always involves managing scalability and resilience, but the specifics differ significantly. Let's use this as the first angle of comparison in this part.

Complexitis of Scaling AI Agents

Microservice scalability requires good engineering, but it follows relatively well-understood operational patterns. The primary goal is handling request load for specific services, which involves configuring horizontal scaling mechanisms. On AWS, this means setting up Auto Scaling Groups for EC2 instances, configuring Service Auto Scaling for Amazon ECS tasks or EKS deployments (with Cluster Autoscaler or Karpenter), or relying on the inherent concurrency scaling of AWS Lambda. Monitoring focuses on metrics like CPU utilization, memory usage, request counts, or queue lengths to trigger scaling events. And yeah, this is all more or less automated, but we still need to understand the fine details.

Agentic AI scaling presents a similar picture, even if the names are different. Teams must monitor and manage capacity across multiple components. The main difference is tool maturity. Here are the main parts you need to scale:

Foundation Model (FM) Throughput: The core reasoning engine itself can become a bottleneck. Ensuring adequate throughput and acceptable latency for LLM inference (e.g., via Amazon Bedrock or Amazon SageMaker endpoints) is critical. For example you can use Amazon Bedrock Provisioned Throughput, where you purchase dedicated inference capacity measured in model units per hour to guarantee performance and potentially achieve lower per-token costs at scale, but this requires accurate capacity planning and commitment. This part is very similar to scaling microservices, but the tooling isn't as mature yet: You get either self-hosted models e.g. in SageMaker (which is comparable to EC2) or a very black box service like Bedrock or external APIs (comparable to Lambda).
Tool Scalability: Each external Tool an agent calls (often an AWS Lambda function or another API) must scale independently to handle the aggregated load from potentially many concurrent agent instances. For this part you can directly use the same techniques as with microservices (in fact, as we'll see later in this article, Tools can easily be microservices). Unless you're consuming external tools, but in that case it's the same as when a microservice uses one.
Knowledge Base (KB) Scalability: For agents using RAG, the vector database (e.g., Amazon OpenSearch Service, Amazon Aurora with pgvector) must handle the concurrent query load generated during the retrieval step. This requires appropriate provisioning (instance sizes, shard counts for OpenSearch) or leveraging serverless options like OpenSearch Serverless or Aurora Serverless and monitoring query latency and resource utilization. You should treat this part just like scaling a regular database (after all, this is a regular database).
Orchestration Layer Scalability: The platform or framework managing the agent's execution flow or multi-agent coordination must also scale. For managed platforms like Bedrock Agents, AWS handles this scaling automatically, though it's always important to understand the limits. For custom frameworks deployed on compute like ECS or Lambda, standard application scaling techniques apply but need to account for the specific load patterns of agent orchestration (take into special account the time waiting for inference).

Operationally, managing agentic scaling requires a more complex baseline: you can't just track overall request and database loads, at a minimum you'll need to look at FM utilization, tool performance, KB query latency, and the health of the orchestration layer itself. Predicting bottlenecks requires understanding the entire execution graph of the agentic task, just like for microservices, but that execution graph tends to be bigger and the tooling is less mature.

Handling Cognitive Failures Operationally

Microservice resilience focuses on surviving infrastructure failures using patterns like Retries, Timeouts, Circuit Breakers (perhaps configured via AWS App Mesh), Bulkheads, and Redundancy. Operationally, handling these involves monitoring infrastructure health, configuring resilience patterns appropriately, and having automated recovery mechanisms (like instance/container replacement via Auto Scaling). Debugging often involves analyzing logs and traces (using AWS X-Ray) to identify the failing network hop or service instance reporting an error.

Agentic AI must handle these infrastructure failures (especially for tool calls) but adds the significant operational challenge of dealing with cognitive failures: the AI producing incorrect, biased, unsafe, or nonsensical outputs (hallucinations) or failing to follow instructions or plan effectively. Detecting and debugging these is operationally different and a lot more difficult:

Detection: While infrastructure failures throw clear error signals, cognitive failures might result in a successfully completed task with a subtly wrong or harmful outcome. Detecting this often requires implementing semantic validation checks on the agent's output (typically with Guardrails that can block the response generation), continuous monitoring against predefined quality metrics or 'golden datasets', or even incorporating human feedback loops. Simple health checks are insufficient, and LLMs don't reliably return accurate response codes (200 for OK, 500 for error, etc).
Debugging: Imagine an agent consistently failing to extract the correct information using a tool, and imagine you've reliably identified this failure. Debugging a microservice failure might involve checking the tool's logs for errors. Debugging the agent failure, however, requires a multi-faceted investigation: examining the exact prompt sent to the LLM, the specific context provided (including any RAG results), the LLM's generated reasoning trace or intermediate 'thoughts' (if logged, and not all providers give you this), the parameters passed to the tool, the tool's actual response, and how the agent interpreted that response. This requires significantly richer logging and specialized debugging skills focused on the interplay between prompt, context, model behavior, and tool interaction. Standard distributed tracing often lacks the semantic depth required, and again we're faced with the lack of tool maturity.
Mitigation Overhead: Implementing AI-specific resilience patterns like Reflection involves managing the logic for self-critique (e.g. have the agent review its plan against constraints or use a separate LLM call to evaluate its own response) and the retry mechanisms, adding latency and cost (more LLM calls). Setting up and monitoring human-in-the-loop workflows for exceptions adds significant process overhead. In the lab, these techniques work very well in most cases (no matter how much people like to criticize LLMs). In the real world, the level of work required, and the cost and especially latency introduced, often make them not really viable. Plus, you'll find a lot less people who know what Reflection is compared to the people who know what a Circuit Breaker is.

Observability and Tooling Maturity

I've already mentioned this a lot in the section above, but I think it's worth it to go beyond “the tooling is less mature”, and discuss what capabilities are missing. Just note that this whole field is so new that I forgive all the tool makers, even if I keep calling out their tools.

Seeing the "Why", not Just the What

The need for robust observability is amplified for agentic systems because of their non-determinism and the opacity of LLM reasoning. The standard three pillars of observability provide a baseline:

Metrics (Amazon CloudWatch): Useful for tracking infrastructure health (Lambda, databases), tool invocation rates/errors, and critically, LLM usage metrics (token counts, inference latency, as provided by Bedrock or SageMaker).
Logs (Amazon CloudWatch Logs, Amazon OpenSearch Service): Essential, but need to capture far more than typical application logs. Effective agent logs must include: the final prompt sent to the LLM (with RAG context), the model configuration used (model ID, temperature), the raw LLM response (including any intermediate reasoning or planned steps), details of tool calls (chosen tool, parameters, response), data retrieved from KBs, and the final output or action taken. Capturing this level of detail systematically is critical, but trust me, logs can get really noisy.
Traces (AWS X-Ray): Useful for tracing requests across agent tool calls if those tools are instrumented microservices or Lambda functions. However, standard tracing tools generally cannot provide visibility inside the LLM's reasoning process or easily correlate distributed trace spans back to the specific semantic context or reasoning step within the agent that initiated them. I know for a fact that AWS is working on improving this, but we'll have to wait a bit more.

Observability 2.0 (which involves wide events as a single source of truth) is, to my knowledge, still absent in the Agentic AI space. I'll update this article if I hear something different.

The primary observability challenge for Agentic AI is achieving visibility into the "Why?" behind an agent's behavior. This requires specialized approaches, potentially integrating detailed logging with tracing tools or using specific features within agentic frameworks or platforms designed for visualizing execution graphs and reasoning steps. We can use standards like OpenTelemetry for LLM calls, but we're still lacking any standards specifically for representing and observing agent reasoning paths, making deep debugging and performance analysis harder than in the more standardized microservice world. Consequently, when planning an agentic system, teams must budget significant effort not just for development, but for building custom observability solutions or carefully evaluating the introspection capabilities offered by managed platforms.

Tooling Maturity and Platform Dependence

This observability gap reflects a broader difference in ecosystem maturity. The microservice world benefits from years of development and standardization efforts, particularly via the Cloud Native Computing Foundation (CNCF) centered around Kubernetes. This provides a rich ecosystem of mature, often interchangeable, open-source and commercial tools for orchestration (Kubernetes itself via Amazon EKS, or alternative orchestrators like Amazon ECS), service mesh (Istio, Linkerd, AWS App Mesh), CI/CD, security scanning, monitoring, logging, and tracing. This standardization grants significant platform portability and flexibility, and more importantly, it turns microservice operations into a known and mostly solved problem (even if the solution is hard and expensive).

The operational tooling landscape for Agentic AI (MLOps/LLMOps) is much younger, more fragmented, and lacks comparable standardization. While excellent libraries exist for building agents, the tools for deploying, managing, observing, and securing complex multi-agent systems at scale are still rapidly evolving. There is no "Kubernetes for Agents" providing a universally accepted, platform-agnostic operational control plane.

This gap forces difficult operational choices:

DIY Approach: Teams can build agentic systems using open-source frameworks and fundamental AWS services (Lambda, Step Functions, SQS, EventBridge, SageMaker for model hosting, OpenSearch for KBs). This offers maximum flexibility and avoids lock-in but requires significant engineering effort to build and maintain the custom orchestration, state management, observability, and deployment infrastructure reliably.
Integrated Platform Approach: Leveraging platforms like Amazon Bedrock Agents provides pre-built, managed capabilities for agent creation, orchestration, tool integration, knowledge bases, and potentially simplified deployment and logging. This dramatically accelerates development but often results in significant platform dependence, tying the architecture and operational model to AWS-specific services and abstractions, potentially limiting future portability or requiring rework if migrating later. You're reading a newsletter called Simple AWS, so you know I'm cool with being locked in with AWS, but it's always important to call this out.

This gap in tooling was also one of my main drivers for writing this series. I was hoping to find lessons from microservices that we could apply to AI agents, but most things weren't that useful or easy to transfer to this domain.

Architects must therefore factor MLOps/LLMOps tooling maturity and the resulting platform dependence trade-offs into their risk assessment and strategic platform choices when designing agentic systems.

Deployment and Lifecycle: MLOps and LLMOps

MLOps and LLMOps aren't terms that most architects know about, but I'll give you a tip: ML engineers/architects who've been doing traditional ML since before ChatGPT know a lot of useful things about these topics; you should talk to them. I've been doing that a lot lately, and this section is a result of that.

Managing the AI Artifact Lifecycle

Microservice deployment typically follows mature Continuous Integration and Continuous Delivery (CI/CD) practices focused on code artifacts. Automated pipelines (using tools like the AWS CodePipeline, CodeBuild and CodeDeploy, or alternatives) handle building container images or function packages, running automated tests (unit, integration), deploying via strategies like Blue/Green or Canary, and monitoring the release.

Deploying and managing Agentic AI systems requires a slightly different set of practices called MLOps (Machine Learning Operations) and LLMOps, encompassing a more complex set of artifacts:

Models: Versioning, deploying new or fine-tuned FMs, A/B testing performance, monitoring for drift or regressions.
Prompts: Treating prompts as code: version control, automated testing (evaluating behavioral impact of prompt changes), safe rollout strategies. Prompt engineering is iterative and requires tight feedback loops.
Knowledge Bases: Pipelines for updating KB data, re-indexing vector stores, validating data quality, and managing different KB versions.
Tools: Standard CI/CD for the underlying code implementing agent tools.
Agent Configuration/Orchestration: Versioning and deploying the definitions that tie models, prompts, tools, and workflows together. With Bedrock you can treat this as infrastructure.
Evaluation: CI/CD pipelines must integrate rigorous automated evaluation suites that test the behavior and quality of agent responses against predefined benchmarks or criteria, going far beyond simple code compilation or unit tests for tools. I'll write more about this in another article.

This composite lifecycle, managing interconnected changes across code, models, data, and prompts, is inherently more complex and requires specialized MLOps/LLMOps tooling and processes that are still maturing. It's not that we don't know how to manage each of these parts individually. The problem is that the behavior of the system is determined by the combination of all of these parts, so we need to manage them as a single set, not independent variables.

Deployment Independence

The fact that these artifacts are so interconnected makes independent deployments more difficult. While a microservice team can confidently deploy their updated service if the API contract holds (that's one of the core points of microservices), updating an agent's prompt might subtly change its interaction with multiple tools or its interpretation of context, requiring broader behavioral testing. Swapping an underlying FM version often requires re-evaluating all associated prompts and tool interactions. This tighter coupling, especially within integrated platforms, makes the practical reality of independent deployment for agentic systems currently less attainable than the ideal achieved in well-architected microservice systems. Again, maturity.

AI vs Microservices Cost

Cost management is another area where the paradigms diverge significantly. In theory you could run your own models or open source ones in your own infrastructure, and the comparison becomes moot. In practice you'll often find yourself using proprietary models, and that's where you need to understand inference costs and tokenomics.

Contrasting Cost Models: GB-seconds and Tokens

Microservice costs on AWS are primarily driven by infrastructure consumption, scaling relatively predictably with usage: compute time (EC2, Fargate, Lambda), database capacity and I/O (RDS, DynamoDB), data storage (S3, EBS), network throughput (ELB, API Gateway, Data Transfer), and messaging volume (SQS, SNS, EventBridge). Costs can be tracked and optimized using standard cloud practices and tools like AWS Cost Explorer, and optimized with Savings Plans and Reserved Instances. If we used serverless compute, this is measured in GB-seconds (hence the title).

Agentic AI costs include these infrastructure components (for tools, KBs, orchestration compute) but are frequently dominated by a factor unique to AI: Foundation Model inference costs, typically priced per token. These are the main points you need to understand:

Token Calculation: Costs apply to both input tokens (the prompt, including instructions, context, RAG results, chat history) and output tokens (the generated response, reasoning steps, or tool parameters). Longer interactions, more complex reasoning, or verbose outputs directly increase costs.
Model Variance: Costs per token vary dramatically between different LLMs. Larger, state-of-the-art models like Claude 3.7 Sonnet can be orders of magnitude more expensive than smaller, faster models like Llama 4 Scout 17B.
Task Complexity Impact: A simple agent query might use a few hundred tokens. A complex multi-step task involving extensive reasoning, RAG lookups, and multiple tool interactions could consume tens or even hundreds of thousands of tokens for a single user request. This makes cost highly sensitive to workflow design.

This token-based pricing makes agentic system costs potentially much more variable and harder to predict than typical microservice costs, which scale more directly with request volume or provisioned infrastructure. A small change in user input could trigger a much longer reasoning path, drastically increasing the token count and cost for that single interaction. Implementing robust cost monitoring, potentially with per-user or per-task tracking and alerts specifically for token consumption becomes more important than ever.

Optimization Strategies: Infrastructure Tuning vs. AI Efficiency

Microservice cost optimization primarily focuses on infrastructure efficiency: right-sizing compute instances and Lambda memory, using AWS Savings Plans or Reserved Instances for baseline load, leveraging cheaper Spot Instances where applicable, optimizing database performance, and implementing data caching with services like Amazon ElastiCache.

Agentic AI also requires these infrastructure optimizations for its non-AI components, but must also prioritize AI efficiency:

Strategic Model Selection (Cascading): This is often the most impactful optimization. Use the simplest, cheapest model suitable for each step. Design workflows that route simple tasks to cheap models and only invoke powerful, expensive models when complex reasoning is truly needed.
Prompt Engineering: Meticulously craft prompts to be concise, reducing input tokens. Guide the model to produce shorter, focused outputs where appropriate to reduce output tokens.
Managing FM Costs: For predictable high volumes, evaluate Amazon Bedrock Provisioned Throughput to potentially lower per-token costs compared to on-demand usage, although this requires accurate capacity planning.
Efficient Workflow Design: Minimize unnecessary LLM calls within an agent's workflow. Optimize the number of reasoning steps or tool interactions required.
Effective Caching: Cache LLM responses (if determinism allows and inputs repeat), knowledge base lookups, or results from idempotent tools to avoid redundant computation and inference costs.

For agentic systems, tuning prompts and strategically selecting models often gets you greater cost savings than purely optimizing the underlying Lambda function memory. That's because in agentic systems and in any generative AI systems costs are typically dominated by inference. So, follow the usual advice: Start by optimizing the biggest thing you can find.

Part 3 Conclusion

So far in this series we've navigated the complex relationship between Microservice and Agentic AI architectures. We started by discussing their distinct origins and foundational principles, moved through the runtime dynamics regarding communication, state, and predictability, and now we've examined the practical realities of operating these systems.

While both patterns employ decomposition to manage complexity, they are fundamentally different tools designed for different primary purposes. Microservices offer a mature, robust paradigm for structuring large applications around business domains, optimizing for engineering lifecycle agility, operational scalability, and reliability through well-understood patterns and tooling. Agentic AI provides a powerful, rapidly evolving paradigm for automating complex tasks, enabling autonomous reasoning and action, and creating intelligent interactions by orchestrating LLMs, tools, and knowledge.

We've gone through all the comparisons I considered worth exploring. In the next part, part 4 of this series, we'll explore hybrid architectures and how we can apply both patterns at the same time. Stay tuned!

Did you like this issue?

Loved it! 💖 | It was good 🙂 | No bueno 😑

Reply

or to participate.