Using Hard Failover

Using Hard Failover

Hard failover is the process of ejecting an unavailable or unstable node from a cluster immediately.

Note: To fully understand how Hard Failover works, you must first understand how Couchbase partitions data across a cluster.

Hard failover is the act of ejecting a node from a cluster when that node is unavailable or unstable (for example, in case of a server or instance failure). When hard failover is initiated, it tells the Cluster Manager to remove the node from active use, regardless of what is occurring, and then to start the procedures necessary for the other nodes in the cluster to take over for that ejected node. The Cluster Manager performs slightly different actions for each service (Data, Index, Query, Full Text Search) running on the node being failed over. We will discuss these service-specific actions later in this section.

You can perform hard failover on any node in the cluster.

Note: Do not use hard failover for regular planned maintenance. For that purpose see the explanations provided in Using Graceful Failover or Removing a Node.

Hard failover can be initiated in multiple ways:

  • Manually via the Couchbase Web Console
  • Using the CLI or RESTful API from a script/program/ops platform
  • Using the automatic failover functionality performed by the Couchbase Cluster Manager

Hard failover is the failover type triggered by the automatic failover when the Cluster Manager determines that a node is unavailable and eligible for automatic failover. To read more about this, see Using Automatic Failover.

Hard failover affects each of the services slightly different, depending whether they are on the same server or instance or whether they are using Multi-Dimensional Scaling and are on different servers or instances.

Hard Failover Effect on Nodes Running the Data Service

If the failed-over node is running the Data service, the cluster initiates multiple operations to make an active set of vBuckets available as soon as possible. Perform a hard failover only when there is a full compliment of replica vBuckets available to promote for each Bucket. For example, if you have four nodes running the Data service, and one node has failed, there are 256 active and 256 replica vBuckets missing from the cluster.

The vBucket resources will look like below:

The moment you initiate a hard failover (either by using the Couchbase Web Console, CLI, REST API or automatic failover), the Couchbase Cluster Manager communicates with the remaining nodes in the Data service to begin the process. The Cluster Manager has information about the location of all active and replica vBuckets across the cluster, as well as about the physical topology of the cluster. It knows which vBuckets were active on the failed node and which of the still existing nodes in the cluster house their corresponding replica vBuckets.

  • The Cluster Manager directs the remaining nodes to promote the replica vBuckets to active status on those nodes. There is no movement or copying of data. The replicas already exist on the remaining nodes; it is simply a matter of flipping their status from replica to active. Therefore, promotion between the statuses is nearly instantaneous on the affected vBuckets.
  • Since the replica status changes represent a cluster topology change, the Cluster Manager communicates the new cluster map to the client SDKs, other clusters replicating to it via XDCR, and so on.

The entire process, from initiating the node failover to completion, is usually in the single-digit milliseconds timeframe. The timing will vary slightly depending on the number of buckets and vBuckets residing on the failed node.

When the process is complete, there will be no vBuckets left on the failed-over node, but the node is still not fully ejected from the cluster. The cluster is considered to be in a degraded state, both regarding the missing node and High Availability; as the corresponding replica vBuckets on the remaining nodes of the cluster have been sacrificed to get 100% of the data running. Unless there were already additional replica data sets for each bucket, the cluster could not withstand another node going down. The cluster does not auto-regenerate missing vBuckets, and once it is degraded it narrows your recovery options. The cluster also must avoid to introduce a cascading failure and potentially make a minor problem more severe. Therefore, the cluster needs to be returned to a stable state as soon as possible.

Hard Failover Effect on Nodes Running the Index Service

When Global Secondary Indexes (GSI) are defined, each is created on only one node running the Index service unless it is specified otherwise. If that node fails, those indexes and their definitions on that node are unavailable. Should the node be repaired and added back via Delta Node Recovery, the indexes on that node will be updated and become available again. If the node is irreparable and replaced, the indexes will need to be created again. To rebalance a new node into the cluster, you must first eject the failed node using hard failover.

The best practices suggest the secondary indexes on at least two nodes running the Index service to mitigate the situation when the indexes are unavailable. For more information, see the documentation on Indexing.

Hard Failover Effect on Nodes Running the Query Service

For the Query service, the nodes are stateless and can be added and removed from the cluster with no consequence to data. However, any in-flight queries on those node(s) will error. As long as the node to be removed is not the last Query service's node, the cluster will continue to function. As one might expect, when node capacity is reduced the cluster can experience higher load on the remaining node(s) which can lead to performance issues on queries.

Hard Failover Effect on Nodes Running the Full Text Search (FTS) Service

Full text indexes are partitioned and automatically distributed across nodes if multiple nodes are running the FTS service. When a node running the FTS service is failed over, it stops taking traffic. If there are no other nodes running FTS, all full text index building stops and full text queries fail. If there is at least one other node running FTS, other nodes continue responding to queries and return partial results, which your application may choose to accept or not, depending on your requirements.

When the administrator rebalances the cluster, it performs multiple operations depending on the level of redundancy you have designed for the FTS service by configuring the replicas for the full text index:

  • If configured, the FTS service promotes the replicas to active on the remaining nodes of the cluster.
  • If not configured, the FTS service rebuilds the indexes on the remaining nodes of the cluster using stored index definitions.

The FTS service does not perform Delta Node Recovery.

Returning the Cluster to a Stable State

If or when the failed node is repaired and ready, it can be added back to the cluster via Delta Node Recovery, Full Recovery, or an entirely new node could be added.

  • If Delta Node Recovery is an option, the Cluster Manager recognizes this node as a previous member of the cluster. If using the Couchbase Web Console, it will prompt the administrator to perform the Delta Node Recovery. In CLI, it will either do it or fail and inform that you have to perform a full recovery.

    When a node is added back to the cluster using Delta Node Recovery, the replica vBuckets on the failed-over node are considered trusted but behind on data. The Cluster Manager will coordinate the vBuckets to become resynchronized, which catches up the vBuckets on the node from where they left off to be current. When the Cluster Manager has finished the synchronization, the vBucket is promoted back to active status, and the cluster map is updated since this is a topology change.

  • If the node is added back using Full Recovery, it is treated as an entirely new node added to the cluster: it will be reloaded with data, and needs a rebalance.
  • The other option is to add a node and rebalance the cluster.

If you can, always attempt on returning the cluster to a properly sized topology before rebalancing. If you do a rebalance before adding the node back in, you can no longer perform the Delta Node Rebalancing.

A Hypothetical Scenario

Imagine you have a Couchbase bucket distributed across four nodes of the Data service in a cluster, where a node needs to be removed right this moment. The server operations on-call engineer calls you at 11 PM on a Friday night to say that node #4 in the cluster is down. The ops team has been unable to get the server back up for the last 10 minutes.

You have followed best practices and have the auto-failover configured. For the Data service, with a four node cluster and one replica for each bucket, there are 256 active and 256 replica vBuckets on each of the four nodes, totaling 1024 active and 1024 replica vBuckets. This particular example will only talk about one specific vBucket, #762, but this process is repeated for all the vBuckets on the node to be failed over:

  1. A hard failover is initiated (automatically or manually) to remove the node where the active vBucket 762 resides, node #4 in this example.
  2. The Cluster Manager promotes the replica vBucket 762 to active status on node #2.
    Note: After the vBucket promotion to active status, the cluster has no replica for vBucket 762 until a rebalance or the Delta Node Recovery, unless there are more replicas configured for this bucket.
  3. As this is a cluster topology change, the cluster map is updated so subsequent reads and writes by the Couchbase client SDKs will go to the correct location for data in vBucket 762, now node #2.

This process all happens in fractions of a second. It is then repeated for the remaining 255 vBuckets of the bucket, one bucket at a time. If there were more buckets, it would proceed to the next bucket and repeat the process there until complete.

What is happening in the application during this process, one might ask? Until the down node is failed over (either automatically or manually) to promote the replica vBuckets to active, the application is receiving errors or timeouts for one-quarter of the reads and writes that would have gone to the now down node. We had four nodes; now we have three. If there were ten nodes in the Data service, the application would be unable to address one tenth of the data until failover is initiated. If the application needs to read before failover happens, the application developer may want to use Replica Reads (see SDK-specific documentation), which is only used for such circumstances.

Why to Use Hard Failover instead of Graceful Failover?

Hard failover is a reactive action for an unhealthy node in the cluster. Graceful failover is meant for planned maintenance. Use hard failover when an unhealthy node needs to be ejected from the cluster right away and get back to 100% of the data available as soon as possible.

Hard failover and multiple nodes
You should failover multiple nodes only at a time when there are enough replicas across all buckets of the Data service, and there are enough servers left so that the cluster can continue to operate.

Normally you would be able to failover one node per replica configured in the bucket/cluster. For example, if you require the ability to failover two nodes, you must configure two replicas for each bucket. Failure to do so will result in a loss of data. Simply put, do not failover more nodes than there are replicas configured for all buckets.

The exception to the above rule is when the Rack/Zone Awareness (RZA) feature is configured. RZA allows designating which nodes are in a server rack in a data center, different VM hosts or availability zones in a cloud hosting provider. It ensures that the replica vBuckets for the nodes residing in Rack A are never in Rack A. When using RZA, it is safe to failover an entire rack’s worth of Couchbase nodes without data loss or interrupting your application: because the other racks contain nodes with the replicas. For more information see Rack Zone Awareness.

Hard failover when the cluster has not recognized that the node is down
In rare cases, the Cluster Manager might fail to recognize that an unhealthy node is down. If this occurs and a graceful failover is not successful, a hard failover can be the answer. To initiate a hard failover for a node in this state, select the Fail Over button using the Couchbase Web Console or use CLI.

If the node’s health issue can be resolved, the node might be added back to the cluster. A delta recovery will be presented as an option if the Cluster Manager detects that it is possible. Otherwise, a full recovery must be used. If the issue cannot be resolved, a replacement node should be added, and then the cluster rebalanced. It is important to restore the cluster to a properly sized topology always before rebalancing. Otherwise, you might cause additional failures as nodes become overloaded.