Failure Considerations
Edit this article in GitHub
Version 2.4

Failure Considerations

Networks and nodes may fail from time to time. Often, Couchbase will automatically handle the failure transparently and not require any user intervention. In some cases however, Couchbase may be unable to automatically handle the failure condition, which will result in errors reported to the application. This page discusses how to develop Couchbase applications to function in times of node or network failure.

Failover is the means by which a node is removed from the cluster and all its owned vBuckets are immediately:
  • Transferred to a replica node (if there are replica nodes)
  • Marked offline (if there are no replica nodes).

What to expect in an application during failure

When a node fails (as opposed to fails over) the client library will be unable to contact the node. Operations may either time out (if the node has become unresponsive; for example if the OS kernel has crashed) or fail immediately with a network error (if the node is actively rejecting connections; for example if the Couchbase service was shut down).

Your life can be made much simpler by simply receiving a code of SERVER_IS_DOWN the moment a node became unavailable. This, however, is usually not the case. The most common type of failure is an unresponsive node or broken network - which results in often-cryptic Timeout errors (or Queue Full). Unfortunately timeout errors can also be received when a network (or server) is genuinely slow (as opposed to crashed).

The client library will generally not retry timed-out operations: by definition, a timed out operation is an operation which could not be completed within a given amount of time. Likewise the application should not retry a timed-out operation. The recommended action is to report the failure up the stack.

What to expect during Failover

Couchbase offers a feature called Failover which allows an offending node to be removed from the cluster, immediately transferring all its vBucket resources (as opposed to rebalance, which transfers these resources incrementally). Failover can either be done manually, or automatically (more on this later).

After the offending node has been failed over the client will detect the node down within a matter of seconds. When the SDK detects a node has been failed over, it will remove it internally from its list of cluster nodes.

Further operations will either succeed or fail depending on the status of the cluster after the failover:
  • If there are remaining replicas the operations will succeed as normal - just as before the server failed
  • If there are no remaining replicas, operations destined for the failed node will now fail with a "NO NODE AVAILABLE" error. This error will continue to be returned for operations belonging to the relevant vBuckets until either:
    • The failed node is re-added to the cluster
    • The cluster is rebalanced (and thus any data on the old node is lost)

Handling missing nodes

A Missing Node error is reported when the cluster is in a degraded state. One of the cluster nodes has been failed over and no replica exists to take its place. As a result, a portion of the cluster’s dataset becomes unavailable.

Recovering from this scenario is possible only through administrative action. The administrator must either re-add the node or rebalance the cluster in order to make the rest of the cluster’s data accessible again.

The SDK will periodically check to see if the cluster is no longer in a degraded state; thus even when the cluster’s state is corrected it may still take some time for this information to be propagated across the clients.

Missing nodes may also affect client bootstrap, as the IP or hostname present in the connection parameters may no longer be up or part of the cluster. You may specify multiple hostnames or IP addresses in the connection parameters. When multiple hosts are supplied, the SDK will attempt to connect to each host in the list until a bootstrap attempt has succeeded, or until a timeout has been reached.

Timeouts

Timeouts are returned by the client when an operation is waiting too long for an acknowledgment from the server. The length of time a client waits for an a response from the server is configurable (see your language’s Couchbase SDK reference for how to configure it).

Timeouts are caused by an unresponsive network or software system. Possible causes are:
  1. Congested network.
  2. Heavy CPU load on the server.
  3. Heavy CPU load on the client.
  4. Physical disruption between client and server
Applications should generally not retry an operation if it resulted in a timeout before performing corrective action and/or analysis to determine the cause of these timeouts.
If your application cannot simply return an error but must ensure that the operation completes, it is suggested that you wait a while until retrying (similar to the mechanism used when receiving a temporary failure).
Warning: Timeout does not mean the operation failed.

It is important to note that receipt of a timeout error does not mean the operation did not complete. It means the server did not respond in a timely fashion. It is possible that the operation completed on the server, but the acknowledgment response was not received by the client. If the CAS of the item is known, you may use it to guard against performing the same operation twice.

Network Failures

Network errors are returned by the client if it cannot establish a network connection to the server, or if an existing network connection was terminated. The cause of a network error may be included in the error object itself (for example, couldn’t resolve host name, connection refused, no route to host).

Like timeout errors, network errors may be transient (indicative of a bad network connection). They may also be a result of a node being failed over.

Applications may retry operations which failed with network errors after waiting for a while (as in timeout errors]. Retrying operations too soon may result in creating massive amounts of TIME_WAIT connections on the client, and in extreme cases, even cause system crashes or rendering it inaccessible via the network.
Warning: Network error does not indicate the failure of the operation.

As in timeout errors, it is not possible to determine if an operation was actually completed if it resulted in a network error.

For cases where it is of the utmost importance to retrieve the item, a read from a replica can be performed. This will retrieve a potentially stale item.

Reading from replicas

High-availability applications can read documents from replicas, exchanging consistency for availability.

If your bucket is configured for replication, then multiple replicas of each item exist within the cluster. By default the client will attempt to access an item using its computed master or active node. This returns the current and authoritative version of the item as it is stored within Couchbase.

In conditions where access to the active node is unavailable (for example, it is disconnected from the network), an application may be able to access a replica version of the item using the get-from-replica operation which queries a replica node for a copy of the item.
Note: The item received from a replica node may be an older version. It is possible a newer version exists in the active node, but did not manage to get replicated before the active node went offline.
function getOrGetReplica(key, callback) {
    bucket.get(key, function(err, res) {
        if (err) {
            bucket.getReplica(key, callback);
            return;
        }
        callback(null, res);
  });
}