Simple AWS
Posts
Event-Driven Architecture patterns

Event-Driven Architecture patterns

Guille Ojeda
July 04, 2024

It's hard to talk about Event-Driven Architecture without resorting to patterns. But that usually results in a lot of confusion, putting the name of a pattern on something that is definitely not that pattern, and above everything else, overcomplicating architectures. So, let's talk about some patterns, what they are, what they're good for, and how much they will cost you. That last part is important, because you don't always enjoy the benefits of patterns, but you always pay the cost.

Event Sourcing

Event Sourcing is a shift in how we persist state. Instead of recording just the current state, you capture every change to an application's state as a sequence of events. The current state can then be obtained by replaying the events in order.

Without Event Sourcing we'd store the current state of our data, and that's it. For example, in a banking application, we might store an account balance as a single value. When a transaction occurs, we update this value directly. Event Sourcing takes a different approach:

Instead of storing the current state, we store a sequence of events that led to that state.
The current state is derived by replaying these events.
New changes are recorded as new events, appended to the event stream.

I think this one is easier to explain with code, so let's look at an implementation in Python:

from datetime import datetime
from typing import List, Dict, Any

class Event:
    def __init__(self, event_type: str, data: Dict[str, Any]):
        self.event_type = event_type
        self.data = data
        self.timestamp = datetime.now()

class Account:
    def __init__(self, account_id: str):
        self.account_id = account_id
        self.balance = 0
        self.events: List[Event] = []

    def apply_event(self, event: Event):
        if event.event_type == 'DEPOSIT':
            self.balance += event.data['amount']
        elif event.event_type == 'WITHDRAW':
            self.balance -= event.data['amount']
        self.events.append(event)

    def deposit(self, amount: float):
        event = Event('DEPOSIT', {'amount': amount})
        self.apply_event(event)

    def withdraw(self, amount: float):
        if amount > self.balance:
            raise ValueError("Insufficient funds")
        event = Event('WITHDRAW', {'amount': amount})
        self.apply_event(event)

    def get_balance(self) -> float:
        return self.balance

    def reconstruct_state(self, point_in_time: datetime = None) -> 'Account':
        reconstructed_account = Account(self.account_id)
        for event in self.events:
            if point_in_time and event.timestamp > point_in_time:
                break
            reconstructed_account.apply_event(event)
        return reconstructed_account

class EventStore:
    def __init__(self):
        self.events: Dict[str, List[Event]] = {}

    def save_events(self, account_id: str, events: List[Event]):
        if account_id not in self.events:
            self.events[account_id] = []
        self.events[account_id].extend(events)

    def get_events(self, account_id: str) -> List[Event]:
        return self.events.get(account_id, [])

# Usage
event_store = EventStore()

account = Account("123")
account.deposit(100)
account.withdraw(30)

event_store.save_events(account.account_id, account.events)

# Reconstruct account state
reconstructed_account = Account("123")
for event in event_store.get_events("123"):
    reconstructed_account.apply_event(event)

print(f"Reconstructed balance: {reconstructed_account.get_balance()}")  # Output: 70

The EventStore class persists events in a generic way, that's independent from the domain model. It's not that hard to implement, right? The key thing is what it means for the entire architecture.

Implications and Complexities

While the concept of Event Sourcing is straightforward, its implications and implementation details can be quite complex:

Event Schema Evolution

As the system evolves, you may need to change the structure of your events. This can be challenging because you have a historical log of events in potentially outdated formats. Solutions include:

Versioning your events
Implementing event upcasters to transform old event formats to new ones
Using schema-less event storage (e.g., storing events as JSON blobs)

Here's a simple implementation of event upcasting:

class EventUpcaster:
@staticmethod
def upcast(event: Event) -> Event:
if event.event_type == 'DEPOSIT' and 'version' not in event.data:
# old version of the DEPOSIT event
upcasted_data = event.data.copy()
upcasted_data['version'] = 1
upcasted_data['currency'] = 'USD' # Assume USD for old events
return Event('DEPOSIT', upcasted_data)
return event

# In the Account class:
def apply_event(self, event: Event):
event = EventUpcaster.upcast(event)
# Rest of the code

Snapshot Management

For long-lived entities with many events, reconstructing the current state can be time-consuming. Implementing a snapshotting mechanism is a good way to reduce these delays:

class AccountSnapshot:
    def __init__(self, account_id: str, balance: float, last_event_id: int):
        self.account_id = account_id
        self.balance = balance
        self.last_event_id = last_event_id

class SnapshotStore:
    def __init__(self):
        self.snapshots: Dict[str, AccountSnapshot] = {}

    def save_snapshot(self, snapshot: AccountSnapshot):
        self.snapshots[snapshot.account_id] = snapshot

    def get_latest_snapshot(self, account_id: str) -> AccountSnapshot:
        return self.snapshots.get(account_id)

# In the Account class:
def create_snapshot(self) -> AccountSnapshot:
    return AccountSnapshot(self.account_id, self.balance, len(self.events) - 1)

def apply_snapshot(self, snapshot: AccountSnapshot):
    self.balance = snapshot.balance
    self.events = self.events[:snapshot.last_event_id + 1]

Concurrency and Conflict Resolution

When multiple processes are appending events to the same stream, you need to handle concurrent modifications. Optimistic concurrency control is often used to resolve conflicts.

class ConcurrencyException(Exception):
    pass

class EventStore:
    # ...

    def save_events(self, account_id: str, events: List[Event], expected_version: int):
        current_events = self.events.get(account_id, [])
        if len(current_events) != expected_version:
            raise ConcurrencyException("Concurrent modification detected")
        self.events[account_id].extend(events)

# Usage
try:
    event_store.save_events(account.account_id, account.events, len(event_store.get_events(account.account_id)))
except ConcurrencyException:
    # Handle the conflict, e.g., by retrying the operation
    pass

Performance Considerations

As any other complex solution, Event Sourcing can introduce a noticeable performance overhead:

Reading the current state requires replaying events can be slow for entities with long histories.
Appending events is generally fast, but can slow down as the event store grows.
Querying can be challenging and may require separate read models (often leading to CQRS, which we'll discuss a bit later in this article).

Event Store Scalability

As your system grows, your event store needs to scale accordingly. Consider partitioning your event store based on aggregate roots or event types, and/or using specialized event store databases like Apache Kafka for large-scale systems. DynamoDB can also work well.

Eventual Consistency

Event Sourcing often leads to eventual consistency, especially when combined with CQRS. This can be challenging for developers and users accustomed to immediate consistency.

Debugging and Auditing

While Event Sourcing provides excellent auditing capabilities, debugging can become more complex, since you need tools to inspect and understand the event stream. Additionally, reproducing issues might require replaying a specific sequence of events.

Use Cases and Trade-offs

Event Sourcing is meant to solve the following challenges:

Comprehensive auditing and regulatory compliance
Temporal querying and historical state reconstruction
Complex domain models where the sequence of changes is crucial
Debugging and system reconciliation

However, it comes with significant disadvantages:

Increased storage requirements and potential performance issues
Added complexity in system design and querying
Learning curve for developers unfamiliar with the pattern
Potential eventual consistency issues

Strangler Fig Pattern

The Strangler Fig pattern, inspired by the strangler fig vines that gradually overtake and replace their host trees, provides a strategy for incrementally migrating a monolithic application to a different architecture pattern, typically microservices.

The core idea of the Strangler Fig pattern is to gradually create and deploy new applications and services around the existing monolithic application, slowly replacing its functionality until the monolith can be decommissioned. This approach allows for a gradual transition, reducing risk and allowing the system to evolve over time.

Implementing the Strangler Fig pattern typically involves these steps:

Identify: Choose a bounded context or module within the monolith to migrate.
Build: Implement the new functionality as a separate service.
Redirect: Use a façade or proxy to route requests to either the monolith or the new service. API Gateway works pretty well for this.
Replace: Gradually increase traffic to the new service and remove the old functionality from the monolith.

Implications and Complexities

The Strangler Fig pattern provides a pragmatic approach to migrate monoliths, and it's most often used on legacy systems. It gives you a much easier and less risky way to do it than an all at once migration, but don't think for a second it's easy. Here are some of the core challenges:

Routing Complexity: As you migrate more functionality, your routing logic can become increasingly complex. You may need to consider:

URL-based routing
Header-based routing
User-based routing (e.g., for A/B testing)
Content-based routing

Data Synchronization: As you migrate functionality, you'll often need to migrate or replicate data. This can lead to challenges in maintaining data consistency between the old and new systems. Strategies to handle this include:

Dual writes: Write data to both old and new systems
Change Data Capture (CDC): Capture changes in the source system and replicate them to the target system. You can do this with DynamoDB Streams.
ETL processes: Periodically sync data between systems

Testing Challenges: Integration testing becomes more complex as you need to test interactions between old and new systems. Above all you need to ensure that both systems produce consistent results for the same input. Additionally, performance testing needs to account for the additional routing layer.

Monitoring and Observability: As you split your monolith, you need to adapt your monitoring and observability practices:

Implement distributed tracing to track requests across old and new systems.
Set up comprehensive logging that covers both systems.
Create dashboards that give you a complete view of the entire system.

Feature Parity: Feature parity means making sure both the old and the new system work the same and offer the same features. A good tip to achieve this is to write comprehensive acceptance tests. Implementing feature flags to easily switch between old and new implementations also helps significantly, especially so you don't go crazy. And I suggest you plan for a period of parallel running where both systems are active (a bit like the multi-site active-active disaster recovery strategy).

Performance Overhead: The routing layer introduces some performance overhead. Consider caching strategies, and as always monitor and tune performance regularly.

Organizational Challenges: The Strangler Fig pattern itself doesn't really change your organization, but the migration and the new system architecture might. You may need to restructure teams to align with the new service boundaries, and there's often a period where teams need to be familiar with both the old and new systems. Plus, keep in mind you need to manage the cultural shift from monolithic to microservices thinking. Remember Conway's Law!

Use Cases and Trade-offs

The Strangler Fig pattern is particularly useful when:

Migrating large, complex monoliths that can't be replaced all at once
Needing to maintain system stability during migration
Wanting to incrementally adopt new technologies or architectural patterns
Reducing the risk associated with big-bang rewrites

But keep in mind these challenges:

Increased complexity during the migration period
Potential performance overhead from the routing layer
Need for careful planning and coordination
Possible temporary duplication of code and data

Command Query Responsibility Segregation (CQRS)

Command Query Responsibility Segregation (CQRS) is an architectural pattern that separates read and write operations for a data store. The idea doesn't seem complex if you think about it conceptually. But it's not just different lines of code, CQRS proposes separating reads from writes at all levels, even at the infrastructure level. That's when you realize how complex it can get.

The Essence of CQRS

At its core, CQRS is about recognizing that as systems grow in complexity the requirements and patterns for reading data can become significantly different from those for writing data. This recognition leads to a fundamental split in how we model and process these operations.

Commands: These are operations that change the state of the system. They represent the "write" side of the equation and include creates, updates, and deletes. Commands are named with imperative verbs and typically include all the information needed to perform the operation. For example, "CreateOrder", "UpdateCustomerAddress", or "CancelSubscription".
Queries: These are operations that read data from the system without causing any state changes. They represent the "read" side. Queries are often named with words like "Get" or "Find", such as "GetOrderDetails" or "FindCustomersByRegion".

In a CQRS system, these two types of operations are handled by separate models:

Command Model: This model is responsible for processing commands. It encapsulates all the business logic required to validate commands, ensure business rules are followed, and apply the resulting state changes. The command model is optimized for writing data and maintaining data integrity.
Query Model: This model is designed to efficiently handle read operations. It's often denormalized and structured in a way that makes it easy to retrieve and present data in the formats required by the user interface or API consumers. The query model is optimized for reading data and can be tailored to specific use cases.

The separation of these models allows each to be optimized for its specific purpose. The command model can focus on consistency and integrity, while the query model can be structured for performance and ease of use.

One of the key aspects of CQRS is that the command and query models don't have to use the same data schema or even the same database technology. For example the command model might use a database that supports ACID transactions (it doesn't have to be relational, DynamoDB supports transactions), and the query model could use an in-memory data grid for fast reads.

The flow of data in a CQRS system typically follows this pattern:

A command is received and validated.
The command handler processes the command, applying business logic and making necessary state changes.
These changes are persisted in the write database.
An event is published indicating what change occurred (e.g. with DynamoDB Streams).
One or more event handlers receive this event and update the read model accordingly.
Queries are served from the read model.

This separation and flow lead to several important characteristics of CQRS systems:

Flexibility: The read and write models can evolve independently, allowing for greater flexibility in system design and evolution.
Scalability: Read and write operations can be scaled independently, which is particularly useful in read-heavy systems.
Optimization: Each model can be optimized for its specific use case, potentially leading to better performance.
Complexity Management: By separating concerns, CQRS can help manage complexity in large, complex domains.

I hope the first thing that came to mind when reading that was "wow, complex!". Let's get a bit more specific about that complexity, and the downsides.

Implications and Complexities

CQRS inherently adds complexity to a system. Instead of a single model, you're now dealing with separate read and write models, along with the infrastructure to keep them synchronized. This definitely makes the system more difficult to understand, develop, and maintain, even if you are familiar with the CQRS pattern (and I'd bet most devs aren't).

Eventual Consistency is another problem. In many CQRS implementations, especially those using event sourcing, the read model is updated asynchronously after the write model is changed. This means you're using eventual consistency, where there's a delay between when data is written and when it becomes available for reading. Not an unheard of problem, but it's yet another thing to take into account.

And that's if the data synchronization doesn't break! Keeping the read and write models synchronized is not trivial. You need to ensure that all changes in the write model are correctly and timely reflected in the read model, and account for the possibility that the write to the write model may succeed but the subsequent write to the read model may fail. You need a way to detect those failures, retry those operations, and log everything, so your database isn't left in an inconsistent state.

Versioning and Schema Evolution is also made more fun than usual (in a bad way). Updating the database schema is more or less a solved problem, for which the solution is known but not trivial. But now you're dealing with two separate schemas, which are compatible (they need to store the exact same data, after all), but likely somewhat different (if they weren't you couldn't get different performance out of them).

Querying can also get a bit more complex, which is very counter-intuitive because you're supposed to be implementing CQRS to improve querying. The problem is that most read models are optimized for some specific access patterns, and not for general analytics. So if you have multiple ways to read data from the application (e.g. simple reads from easily cacheable data, complex queries that are repetitive, and unrestricted analytics) you'll need multiple read models. Not all of those read models need all of the data (in fact it's preferable if you don't replicate all the data), but if you ever need to do cross-model reads because the data is spread out, my heart goes out to you.

We're making reads and writes distributed and eventually consistent, so of course Testing will be harder. An important insight is that you've separated reads and writes at the conceptual level of your application. Writing data is one feature solved by one part of the application, and reading data is a completely separate feature solved by a completely separate part of the application.

Operational complexity also goes up significantly in a CQRS system. You suddenly need to monitor and manage multiple data stores, build redundancy and ensure failures don't cascade, and handle errors like event processing failures or data inconsistencies. If you're already running multiple distributed microservices, this may not be that hard, though it does add some complexity. If you're adding CQRS to a monolith, well... just don't.

Performance should improve, but it may not be worth it. Queries for which you optimize read models will nearly always be faster, but all the increased costs (especially engineering costs for managing all that complexity) may make a 5 ms improvement not worth it. Besides, the time between writing data and that data being available for reading will likely increase (i.e. performance in that area will decrease).

Composite operations are now a much bigger problem. Consider a simple scenario where you read a value and write a value that depends on the value you read (e.g. you increase it by 1). That's not too difficult without CQRS, you just perform both operations inside a transaction to make sure they happen atomically, which behind the scenes gets a lock on the database so you can perform both operations without worrying about concurrency. With CQRS that moderately simple action now means acquiring a lock across two separate databases.

Debugging and traceability also become a lot less fun (and it's not like they were fun to begin with). CQRS means at least two distributed components that scale independently. So anything that requires you to debug both reads and writes will need good distributed tracing.

A higher AWS bill. CQRS is just less optimal, resource-wise. You need more nodes to keep reading and writing separate, plus additional processing for synchronization, redundant data stores, and more infrastructure to run all your error handling code.

When should I use CQRS?

So, yeah, CQRS is pretty damn complex. It's great when you need reads and writes to scale separately and to be managed separately, but other than that it's rarely worth the complexity. Having multiple read models is usually not reason enough to go for CQRS, most of the times you can solve that with a separate analytics database.

A key tip is that you don't need to implement CQRS for every data model you have. If you find a specific part of your application (a domain, bounded context, data model, microservice, etc) where you're confident CQRS will be worth it, then make sure to separate that from the rest of the data, and implement CQRS only there. No need to make everything else more complex than it needs to be.

Conclusion: Applying Patterns in the Real World

Now you can say you know Event Sourcing, the Strangler Fig Pattern and CQRS. Or at least you know what they are. However, the critical thing to know is that in most situations you should not apply any of these patterns. They don't provide general benefits! They solve specific problems. Only use them if you have those problems.

Examples of problems they solve:

Strangler Fig is very useful to gradually migrate a monolith to microservices.
Event Sourcing will allow you to get a complete audit trail for critical business processes.
CQRS can separate reading and writing the same data as two separate features deployed and scaled independently, and will let you optimize for multiple conflicting read patterns.

But they will always increase complexity, and they will increase costs significantly (especially engineering costs, for dealing with that complexity).

Think of these patterns as a boat: Boats are very useful, but only if you live near a navigable body of water (lake, river, sea). And it won't solve everything, it will only allow you to navigate on that body of water, but you'll still need a car. And it will cost you, always, inevitably. Especially in maintenance costs. So, would you buy a boat?

Did you like this issue?

Loved it! 💖 | It was good 🙂 | No bueno 😑

Reply

or to participate.