Disaster Recovery in the eyes of Virtual Consensus

Let’s explore the idea of recovering from a disaster in a Virtual Consensus-shaped implementation of the Consensus Protocol. Firstly, we’ll dive into what a Virtual Consensus brings to the table - what’s different when implementing the Consensus protocol using the Virtual Consensus approach. Then, we’ll challenge certain characteristics of the Virtual Consensus and see why they are important. And, lastly, we’ll see in which scenarios having the Consensus Protocol implemented with the Virtual Consensus approach has benefits in a Disaster Recovery.

In this article, Disaster Recovery is any failure (node, network, disk, etc.) which prevents a system from providing its services - we consider this system unavailable. One example of a disaster would be a majority of nodes down in a RAFT-based system.

Virtual Consensus

The idea of Virtual Consensus originated from Delos, built by Facebook. I'm going to cover Virtual Consensus in very coarse detail. Jack Vanlightly did a great job explaining the paper in more detail - definitely a recommendation to read before continuing if you are not already familiar with the topic.

Let’s rewind from the beginning; we start with the shared log. A lot of replicated systems are, underneath, just an append-only log: you append a command, read the log back in order, and apply each command to your state machine. Same commands, same order, same state, everywhere.

Virtual Consensus splits that log into two layers.

The bottom layer is the Loglet - the Data Plane. The Loglet durably stores commands in a total order on disk. The only thing it needs to implement is replication. It does not need leader elections or cluster membership changes. It barely needs to be fault-tolerant. However, it must implement the seal - stop appending commands once told to. Also, it must provide a way to read the tail of its log.

The top layer is the VirtualLog - the Control Plane. It chains a sequence of Loglets into one continuous logical log: positions [0, 100) on Loglet A, [100, 250) on Loglet B, [250, ∞) on Loglet C. To a client, the log doesn’t seem to be split into different loglets. The chain maps show which positions belong to which loglet. It is stored in a small, separate, genuinely fault-tolerant register called the MetaStore: a single versioned value with a compare-and-swap. This is the only place real consensus lives (one could use RAFT to implement the VirtualLog).

appends do not go through the VirtualLog. Clients directly append to the active Loglet. The VirtualLog sits off the critical path and is consulted only when the chain needs to change. We change the chain when we want to reconfigure Loglets (due to maintenance or, depending on the replication protocol, even when the Loglet leader changes). Reconfiguration is done in three steps:

Seal the active Loglet, so it does not accept new appends
Install a new chain in the MetaStore with compare-and-swap
Fetch the new chain on the clients

Because reconfiguration can install any new Loglet, leadership and failover are no longer the Loglet’s problem. When a Loglet's leader dies, you don't elect a new one inside it - you seal the dead one and install a fresh one. Leader election becomes a configuration change. And the seal, which at first looks like a mere "stop" button, is actually the recovery primitive: it's the operation that lets you safely walk away from a Loglet that's in trouble. Keep that in mind - it's the hero of the data-plane story.

One disaster becomes two contained ones

The first thing the split buys you is a blast radius cut in two. The data lives in the Loglets. The map lives in the MetaStore. And appends bypass the control plane entirely. Those three facts mean a disaster in one plane is not a disaster in the other.

A control-plane outage is not a data outage. If the MetaStore loses its majority, a client with a cached chain keeps appending and reading on the current Loglets without noticing - the data path does not involve the control plane. We lost the ability to reconfigure. The system degrades to a frozen configuration, not an outage. However, if we make the Loglet’s replication protocol depend on the VirtualLog to elect the leader, we are one step closer to disaster, since a leader failure in the Loglet causes the whole system to become unavailable.

And a Loglet failure is not a whole-system failure. You seal the broken Loglet, reconfigure around it, and the map and every other Loglet are untouched. The damage stays inside one segment. Remember that the seal is quite important here - if we are unable to seal, we are in a disaster scenario.

Recovering the Data Plane

I want to focus here on quorum-based replication protocols. In a quorum-based replication protocol, losing a minority of nodes is not a big deal, and Virtual Consensus helps here a lot - we rely on the Control Plane to seal the current Loglet and start a new one with a new leader. The real disaster happens when we cannot seal - we've lost a majority of nodes. In other words, Data Plane recovery is the art of staying sealable.

With ordinary majority quorums, there's a hard symmetry: you need a majority for all operations requiring a quorum (seal quorum and commit quorum in our case). When a majority is lost, you can't do any of these operations (commit nor seal), and once a Loglet cannot be sealed, the whole system stalls - you can't append nor read its tail, and you can't fence it against returning servers that might fork the log.

Can we do something about this symmetry? As a matter of fact, we can. The tool is Flexible Paxos. Flexible Paxos noticed that you can relax the quorum requirement among different operations that require a quorum - essentially you need an overlapping quorums. What this mean in practice is that the sum of two quorums must be greater than the total number of servers in a cluster. On three nodes, all we need is commit + seal > 3.

Now watch what happens to the data. Because a commit requires every node, a committed entry lives on every node. So any single survivor holds the entire committed prefix. The Loglet is sealable, and its boundary recoverable, as long as a single node is alive. The quorum requirement didn't vanish; it moved - from "majority lost" all the way out to "everything lost." The exact disaster that was hopeless a moment ago is now survivable, right up until the final node dies. In scenario in which all nodes fail, the only thing we could do is to restore a backup (hoping we took regular ones :)).

The cost we are paying is that a Loglet must have all nodes up and running for it to continue serving appends. The Loglet becomes fragile to losing any node, but it's almost always sealable. This is where Virtual Consensus shines. The Loglet was never the source of availability; the Control Plane is, through reconfiguration. So you let the Loglet fail if anything goes wrong, secure the seal it and install a fresh Loglet.

Recovering the Control Plane

Nothing sits above the Control Plane to fence it, so unlike a Loglet it can't be reconfigured - it has to recover itself. Although it sounds scary, it doesn't have to be. The data Control Plane holds is small - it consists of merely configuration of itself and Loglets. There are two approaches we could take here.

The first one is just to spread the Control Plane nodes geographically. Control Plane doesn't have to be fast, so spreading it across 5 (or 7) regions will not affect our Data Plane and the speed of appends (the most important benefit of Virtual Consensus). This gives us a wider surface area and makes us less error prone. However, in certain deployoments, this doesn't have to be a valid scenario.

The second approach would be to backup the state of the Control Plane on each reconfiguration. Those aren't frequent anyway, so the overhead should be minimal. Now, if a majority of Control Plane nodes fail, we must force a new cluster on it (etcd has --force-new-cluster for example) and recover from a backup.

Let's think more about this restore procedure in a scenario in which the majority of nodes in the Control Plane failed, and Loglets are available. Even if we don't have a valid backup, we could restore the Control Plane from the state of Loglets! If we traverse through the chain of loglets and ask for the tail (remember that any node in any Loglet has up-to-date information on the tail due to our Flexible Paxos (mis)use), we could restore the chain completely! So, we need at least one node from each Loglet to be able to restore the state of the Control Plane. Nice!

Recap

In a convergent (monolithic) distributed Consensus system, the same nodes hold information about the Control and Data parts of it. Failure is centralized and affects both, Control and Data planes, which more often than not requires a manual intervention. In a dissagregated system, we split the responsibility to different roles - Control and Data. If responsibilities are spread correctly, recovery from a loss may be easier than with convergent systems.

Virtual Consensus does not abandon the quorum, nothing can. It splits the problem into recovery of Data and Control planes. With flexible quorum on the Data Plane, geo-distribution and backups on the Control Plane, and cross-plane reconciliation we have more options for handling the disaster.

Search This Blog

Milan's event thoughts