Failing over a Node

Failing over a Node

Failover is the process in which a node of a Couchbase cluster is removed quickly as opposed to a regular removal and rebalancing.

There are three types of failovers: graceful, hard, and automatic.

Note: To fully understand how failover works, see Distributed Data Management to learn how Couchbase partitions data and how database services are distributed across a Couchbase cluster.
Graceful failover
Graceful failover is the proactive ability to remove a Data service node from the cluster in an orderly and controlled fashion. It is an online operation with zero downtime that is achieved by promoting replica vBuckets on the remaining cluster nodes to active and the active vBuckets on the affected node to dead. This type of failover is primarily used for planned maintenance of the cluster.
Hard failover
Hard failover is the ability to drop a node quickly from the cluster when it has become unavailable or unstable. Dropping a node is achieved by promoting replica vBuckets on the remaining cluster nodes to active. Hard failover is primarily used when there is an unplanned outage of a node.
Automatic failover
Automatic failover is the built-in ability to have the Cluster Manager detect and determine when a node is unavailable and then initiate a hard failover.

Keep in mind that each of the four Couchbase services; Index, Query, Data and Full Text Search (FTS) react slightly differently due to their roles. For example, if you have the services separated on different nodes using Multi-Dimensional Scaling (MDS), you can failover and recover nodes separately. If you are mixing the services on the same nodes, it is slightly more complex, but not much. Failover in regards to all four services will be discussed in this section.

Choosing a Failover Solution when Nodes are Unhealthy

As a node failover has the potential to reduce the performance of your cluster, you should consider how best to handle a failed node situation and also size your cluster to plan for failover.

Manual or monitored failover
Performing manual failover through monitoring can take two forms, either by human monitoring or by using a system external to the Couchbase Server cluster. An external monitoring system can monitor both the cluster and the node environment so that you can make a more data-driven decision. Although automated failover has inherent issues, choosing whether to use manual or monitored failover is not without potential problems.
Human intervention
One option is to have a human operator respond to alerts and make a decision. Humans are uniquely capable of considering a wide range of data, observations, and experiences to resolve a situation in a best possible way. Many organizations disallow automated failover because they want a human to consider the implications. The drawback is that human intervention is slower than using a computer-based monitoring system.
External monitoring
Another option is to have a system monitoring the cluster via the Couchbase REST API. Such an external system can failover nodes successfully because it can take into account system components that are outside the scope of Couchbase Server.

For example, monitoring software can observe that a network switch is failing and that there is a dependency on that switch by the Couchbase cluster. The system can determine that failing Couchbase Server nodes will not help the situation and will, therefore, not failover the node. The monitoring system can also determine that components around Couchbase Server are functioning and that various nodes in the cluster are healthy.

If the monitoring system determines the problem is only with a single node and remaining nodes in the cluster can support aggregate traffic, then the system may safely failover the node using the REST API or command-line tools.

Automatic failover
With the automatic failover, the Cluster Manager handles the detection, determination of, and initiation of the processes to failover a node without user intervention and without identification of the issue that caused the node failure. Once the problem has been identified and fixed, it still requires you to initiate a rebalance to return the cluster to a healthy state.

If you do not use automatic failover and a node becomes unhealthy, it will be up to the administrators to intervene and initiate a hard failover of the node.

The Difference Between Graceful and Hard Failover

A hard failover is reactionary and usually comes when a node of the cluster is in an unhealthy or unstable state (such as when a node is down). It would be used during an unplanned outage. With a node failed, the cluster is without the full accompaniment of active vBuckets. A hard failover will eject the unhealthy node from the cluster and promote the necessary replica vBuckets on the remaining nodes to bring the active vBucket count back to 1024 for each bucket

A graceful failover, which is a coordinated synchronization and removal of the node, is a proactive action initiated from a stable state of the cluster. It might be used, for example, during a software or OS upgrade. The cluster always has the full 1024 active vBuckets for each Bucket throughout the process.