Cluster Manager

Cluster Manager

Cluster Manager runs on all the nodes of the cluster and orchestrates cluster wide operations.

The Cluster Manager is responsible for the following operations:
  • Cluster topology and node membership
    • Managing node membership, adding and removing nodes
    • Discovery of cluster topology by internal and external connections
    • Service layout for data, index, and query services across nodes
    • Rebalancing the load as cluster topology changes
    • Node health, and failure and service monitoring
  • Data placement
    • Smart distribution of primary and secondary replicas with node, rack failure domain awareness for best failure-protection
  • Central statistics and logging
    • Operational statistics gathering and aggregation to cluster level statistics
    • Logging services for cluster supportability
  • Authentication
    • Authentication of connections to the cluster
Figure 1. Cluster Manager Architecture

The Cluster Manager consists of the following modules to perform the tasks above:
  • REST API and Auth modules: Cluster Manager communication and authentication happen through the REST API and Auth modules. All administrative operations performed through CLI tools or Admin Portal are executed through the admin REST API.
  • Master Services module manages global cluster level operations such as master and replica vbucket placement, auto failover and rebalance.
  • Bucket Services module manages bucket level operations such as establishing or handing off replication for replica maintenance or bucket level stats collection.
  • Per-node Services module manage node health and process/service monitoring and restart.
  • Cluster manager generic local and distributed facilities also manage local and distributed configuration management, cluster-wide logging and more.

Node Membership: Adding and Removing Nodes Without Downtime

The Cluster Manager is responsible for cluster membership. When the topology of a cluster changes, the Cluster Manager walks through a set of carefully orchestrated operations to redistribute the load while keeping the existing workload running without a hiccup.

The following workflow describes the high-level operations to add a new node to the data service:
  1. The Cluster Manager ensures the new nodes inherit the cluster configuration.
  2. In order to redistribute the data to the new nodes, the Cluster Manager initiates rebalance and recalculates the vBucket map.
  3. The nodes which are to receive data initiate DCP replication streams from the existing nodes for each vBucket and begin building new copies of those vBuckets. This occurs for both active and replica vBuckets depending on the new vBucket map layout.
  4. Incrementally as each new vBucket is populated, the data replicated and the indexes optionally updated, an atomic switchover takes place from the old vBucket to the new vBucket.
  5. As the new vBuckets on the new nodes become active, the Cluster Manager ensures that the new vBucket map and cluster topology is communicated to all the existing nodes and clients. This process is repeated until the rebalance operation completes running.

Removal of one or more nodes from the data service follows a similar process by creating new vBuckets within the remaining nodes of the cluster and transitioning them off of the nodes to be removed. When there are no more vBuckets assigned to a node, the node is removed from the cluster.

When adding or removing nodes from the indexing and query services, no data is moved and so their membership is simply added or removed from the cluster map. The client SDKs automatically begin load balancing across those services using the new cluster map.

Node Failure Detection

Each node within a Couchbase Server cluster communicates with one another via a heartbeating mechanism. Traditionally, these heartbeats are sent between the cluster managers on all nodes at regular intervals. The heartbeats contain basic statistics about the node, which are used to assess the health of the node.

For the purposes of automatic failover, one node is designated as the orchestrator based on an election by the other nodes. This orchestrator is then responsible for keeping track of the heartbeats received from other nodes. If automatic failover is enabled, then after a specified period of time given by the automatic failover timeout period, if no heartbeats have been received from a node the orchestrator node will automatically failover the node. This is dependent on other criteria being met as detailed on Using Automatic Failover such as only a single node being unhealthy.

This approach is fairly brittle and susceptible to 'false positives', that is a node being identified as offline even if the service on that node (e.g. Query, Data etc) continues to serve traffic without issue. Additionally, as the orchestrator is solely responsible for deciding which nodes to failover it is vulnerable to issues such as network partitions.

To mitigate these issues, the monitoring of node health is performed by a node monitor, which exists outside of the cluster management process, rather than the cluster manager itself. This node monitor is responsible for monitoring the services which are running on the node and including this information in heartbeats. The node monitor heartbeats are similar to the cluster manager heartbeats in that they are exchanged between all nodes, but these heartbeats contained much more detailed information. Each node then decides whether a particular node is healthy based on the heartbeats that it receives from the node. A node is only considered for automatic failover if every other node reports that it is unhealthy.

Currently the following services are monitored:

  • Cluster Manager - The cluster management process' health is assessed by monitoring the basic heartbeating that the cluster manager performs. If the cluster manager is not sending out its basic heartbeats to other nodes then it is considered unhealthy.

  • Data Service - The health of the Data service is monitored by inspecting the DCP traffic coming out of the node. If the memcached process is not sending data or no-ops (internal keep-alives) along its connections then it is considered unhealthy.

This monitoring is then used to more accurately inform the decision about whether to automatically failover a node. By using specific service monitoring, the chance for 'false positives' is greatly reduced, although it is still vulnerable to an unstable network.

Smart Data Placement with Rack and Zone Awareness

Couchbase Server buckets physically contain 1024 master and 0 or more replica vBuckets. The Cluster Manager master services module governs the placement of these vBuckets to maximize availability and rebalance performance.

The Cluster Manager master services module calculates a vBucket map with heuristics to maximize availability and rebalance performance. The vBucket map is recalculated whenever the cluster topology changes. The following rules govern the vBucket map calculation:
  • Master and replica vBuckets are placed on separate nodes to protect against node failures.
  • If a bucket is configured with more than 1 replica vBucket, each additional replica vBucket is placed on a separate node to provide better protection against node failures.
  • If server groups are defined for master vBuckets (such as rack and zone awareness capability), the replica vBuckets are placed in a separate server group for better protection against rack or availability zone failures.

Centralized Management, Statistics, and Logging

The Cluster Manager simplifies centralized management with centralized configuration management, statistics gathering and logging services. All configuration changes are managed by the orchestrator and pushed out to the other nodes to avoid configuration conflicts.

In order to understand what your cluster is doing and how the cluster is performing, Couchbase Server incorporates a complete set of statistical and monitoring information. The statistics are accessible through all the administration interfaces - CLI ( cbstats tool), REST API, and the Couchbase Web Console.

The Couchbase Web Console provides a complete suite of statistics including the built-in real-time graphing and performance data. It gives great flexibility as you (as an Administrator) can aggregate the statistics for each bucket and choose to view the statistics for the whole cluster or per node.

The statistics information is grouped into categories, allowing you to identify different states and performance information within the cluster.
Statistics on hardware resources
Node statistics show CPU, RAM and I/O numbers on each of the servers and across your cluster as a whole. This information is useful to identify performance and loading issues on a single server.
Statistics on vBuckets
The vBucket statistics shows the usage and performance numbers for the vBuckets. This is useful to determine whether you need to reconfigure your buckets or add servers to improve performance.
Statistics on views and indexes
View statistics display information about individual views in your system such as number of reads from the index or view and its disk usage, so that you can monitor the effects and loading of a view on the Couchbase nodes. This information can indicate that your views need optimization, or that you need to consider defining views across multiple design documents.
Statistics on replication (DCP, TAP, and XDCR)
The Database Change Protocol (DCP) interface is used to monitor changes and updates to the database. DCP is widely used internally to replicate data between the nodes, for backups with cbbackup, to maintain views and indexes and to integrate with external products with connectors such as Elasticsearch connector, Kafka connector or the Sqoop connector. XDCR replicates data between clusters and uses DCP in conjunction with an agent that is tuned to replicate data under higher WAN latencies.

TAP is similar to DCP, but is a deprecated protocol. Legacy tools may still use the protocol and stats are still available through the console.

Given the central role of replication in a distributed system like Couchbase Server, identifying statistics on replication is critical. Statistics in replication help visualize the health of replication and bottlenecks in replication by displaying replication latency and pending items in replication streams.