Beyond Total Ordering in Event-Sourced Systems

In Event-Sourced systems, relying on the total order of events on all Event Store nodes is comforting, and provides ease during the design and implementation. A total order of events on all nodes is usually achieved via single-leader Consensus Protocols, which, unfortunately, have a bottleneck - the leader (I explained this in more detail in my previous article).

The main reason for writing this article is to provide a solution for the potential issue of having a single leader as a bottleneck. To be clear, this is the issue that not so many systems meet. However, it is not satisfactory to leave it open, and that's why I am going to dive into a possible solution - Leaderless Consensus Protocols.

Leaderless Consensus Protocols

There is this mystical land of leaderless consensus protocols - a land where there is no bottleneck of a single leader, but each node can be the leader for an append. In other words, such a cluster can take appends via multiple Event Store nodes. One representative of Leaderless Consensus Protocols would be Egalitarian Paxos. Leaderless protocols base their logic on detecting conflicts (append conflicts in Event Store) and executing non-conflicting operations without designated order. If Event Store executes append operations in a different order, the order of events on different nodes in the cluster may differ (see Diagram 1).

Is this really an issue? For appending event transactions, it is not - if two appends do not conflict, they may be executed in a different order on different nodes.

Diagram 1

For this example, let's say that events in the Event Store cluster are bank account events. Each color represents a different bank account. In this cluster, we see that events are stored in a different order on different Event Store nodes. However, events belonging to the same account are stored in the same order on all nodes. The reason for this is that we marked events coming from the same bank account as conflicting and Leaderless Consensus Protocol took care that they are always stored in the same order on all nodes. This is not true for events assigned to different bank accounts. We may also say that instead of total ordering, the Event Store provides partial ordering.

Event Replay

Let's address the elephant in the room - Could we deterministically replay events? By deterministically replay events I mean if a client starts replaying events from one Event Store node, and during this replay, it reconnects to another node, it would continue to replay where it left off (ease and comfort of total event order). When all nodes had the same order of events, continuing the replay was easy, the client needed to remember the replay position in the stream and send it to the node it reconnected to. With partial ordering, this is not possible.

What if the client keeps a replay position for each bank account (and possibly other domain concepts)? Unfortunately, depending on the system, this could mean billions of positions, and it would not work out. It is not feasible to send billions of positions for each bank account just to continue replaying when the client reconnects. The problem with this approach is granularity, keeping progress on this level is too fine-grained.

We could proclaim all appends as conflicting and achieve total order. The problem with this approach is the performance - although we still have appends spread across multiple nodes, all of them are conflicting, and as a consequence, the latency will dramatically increase, and the throughput will drop. Again, the problem is granularity, forcing all appends to be conflicting is too coarse-grained.

The solution could be something in the middle. If we can group events into a few streams, we could exploit the benefits of Leaderless Consensus Protocols much better. The number of streams is arbitrary, and it depends on the domain. Events should be as much as possible evenly distributed across streams (in order to have an optimal number of conflicts). Events are stored in the same order inside the stream, but outside the stream, they can be stored in a different order. The client will remember the replay position per stream and when it reconnects to a different node it will send the position for each stream - Event Store can deterministically replay events.

While events are stored in a different order internally in the Event Store, to the outside world they are stored in the very same order since the event replay is deterministic.

Recap

Leaderless Consensus Protocols solve the problem of a bottleneck that single-leader Consensus Protocols have by adding the capability to each node to accept append operations. As a consequence, the throughput of the whole system is increased. However, there is a cost that needs to be paid - there is no designated execution order for non-conflicting appends. This challenge is not visible during the append phase but it is during the read (replay) phase.

Grouping events into a number of non-conflicting streams will help during the replay phase - the client remembers the replay position per stream. Determining the number of streams, and assigning events to streams are not trivial operations. Also, reconfiguring the streams once events are stored is difficult - it would require changing the order of events on different Event Store nodes.

As I mentioned in the beginning, not many systems hit the problem of having a single leader. Then and only then You should reach for other solutions such as Leaderless Consensus Protocols. Usually, these solutions come with a price that should be carefully taken into account.

Search This Blog

Milan's event thoughts