Networks and nodes may fail from time to time. Often, Couchbase will automatically handle the failure transparently and not require any user intervention. In some cases however, Couchbase may be unable to automatically handle the failure condition, which will result in errors reported to the application. This page discusses how to develop Couchbase applications to function in times of node or network failure.
- Transferred to a replica node (if there are replica nodes)
- Marked offline (if there are no replica nodes).
What to expect in an application during failure
When a node fails (as opposed to fails over) the client library will be unable to contact the node. Operations may either time out (if the node has become unresponsive; for example if the OS kernel has crashed) or fail immediately with a network error (if the node is actively rejecting connections; for example if the Couchbase service was shut down).
Your life can be made much simpler by simply receiving a code of SERVER_IS_DOWN the moment a node became unavailable. This, however, is usually not the case. The most common type of failure is an unresponsive node or broken network - which results in often-cryptic Timeout errors (or Queue Full). Unfortunately timeout errors can also be received when a network (or server) is genuinely slow (as opposed to crashed).
The client library will generally not retry timed-out operations: by definition, a timed out operation is an operation which could not be completed within a given amount of time. Likewise the application should not retry a timed-out operation. The recommended action is to report the failure up the stack.
What to expect during Failover
Couchbase offers a feature called Failover which allows an offending node to be removed from the cluster, immediately transferring all its vBucket resources (as opposed to rebalance, which transfers these resources incrementally). Failover can either be done manually, or automatically (more on this later).
After the offending node has been failed over the client will detect the node down within a matter of seconds. When the SDK detects a node has been failed over, it will remove it internally from its list of cluster nodes.
- If there are remaining replicas the operations will succeed as normal - just as before the server failed
- If there are no remaining replicas, operations destined for the failed node will now fail with a "NO NODE AVAILABLE" error. This error will continue to be returned for operations belonging to the relevant vBuckets until either:
- The failed node is re-added to the cluster
- The cluster is rebalanced (and thus any data on the old node is lost)
Handling missing nodes
A Missing Node error is reported when the cluster is in a degraded state. One of the cluster nodes has been failed over and no replica exists to take its place. As a result, a portion of the cluster’s dataset becomes unavailable.
Recovering from this scenario is possible only through administrative action. The administrator must either re-add the node or rebalance the cluster in order to make the rest of the cluster’s data accessible again.
The SDK will periodically check to see if the cluster is no longer in a degraded state; thus even when the cluster’s state is corrected it may still take some time for this information to be propagated across the clients.
Missing nodes may also affect client bootstrap, as the IP or hostname present in the connection parameters may no longer be up or part of the cluster. You may specify multiple hostnames or IP addresses in the connection parameters. When multiple hosts are supplied, the SDK will attempt to connect to each host in the list until a bootstrap attempt has succeeded, or until a timeout has been reached.
Timeouts are returned by the client when an operation is waiting too long for an acknowledgment from the server. The length of time a client waits for an a response from the server is configurable (see your language’s Couchbase SDK reference for how to configure it).
- Congested network.
- Heavy CPU load on the server.
- Heavy CPU load on the client.
- Physical disruption between client and server
It is important to note that receipt of a timeout error does not mean the operation did not complete. It means the server did not respond in a timely fashion. It is possible that the operation completed on the server, but the acknowledgment response was not received by the client. If the CAS of the item is known, you may use it to guard against performing the same operation twice.
Network errors are returned by the client if it cannot establish a network connection to the server, or if an existing network connection was terminated. The cause of a network error may be included in the error object itself (for example, couldn’t resolve host name, connection refused, no route to host).
Like timeout errors, network errors may be transient (indicative of a bad network connection). They may also be a result of a node being failed over.
As in timeout errors, it is not possible to determine if an operation was actually completed if it resulted in a network error.
Reading from replicas
High-availability applications can read documents from replicas, exchanging consistency for availability.
If your bucket is configured for replication, then multiple replicas of each item exist within the cluster. By default the client will attempt to access an item using its computed master or active node. This returns the current and authoritative version of the item as it is stored within Couchbase.
try: result = cb.get("docid") except CouchbaseNetworkError as e: print "Got error. Fetching from replica!" result = cb.get(‘docid’, replica=True)