Monitoring a rebalance operation

Monitoring a rebalance operation

Monitoring of the system during and immediately after rebalancing is needed until replication is completed successfully.

As Couchbase Server moves vBuckets within the cluster, Couchbase Web Console provides a detailed rebalancing report. You can view the same statistics via a REST API call. If you click on the drop-down list next to each node, you can view the detailed rebalance status:

The section Data being transferred out shows that a node sends data to other nodes during rebalance.

The section Data being transferred in shows that a node receives data from other nodes during rebalance.

A node can be a source, a destination, or both the source and the destination for data. The progress report displays the following information:

  • Bucket : Name of bucket undergoing rebalance. Number of buckets transferred during rebalancing out of total buckets in a cluster.
  • Total number of keys : Total number of keys to be transferred during rebalancing.
  • Estimated number of keys : Number of keys transferred during rebalancing.
  • Number of Active# vBuckets and Replica# vBuckets : Number of active vBuckets and replica vBuckets to be transferred as part of rebalancing.

You can also use cbstats to see underlying rebalance statistics.

Backfilling

The first stage of replication reads all data for a given active vBucket and sends it to the server that is responsible for the replica. This can put increased load on the disk as well as network bandwidth, but it is not designed to impact any client activity. You can monitor the progress of this task by watching for ongoing TAP disk fetches. You can also watch cbstats tap , for example:

cbstats <node_IP>:11210 -b bucket_name -p bucket_password tap | grep backfill

This returns a list of TAP backfill processes and whether they are still running ( true ) or done ( false ). During the backfill process for a particular TAP stream, the output is as follows:

eq_tapq:replication_building_485_'n_1@127.0.0.1':backfill_completed: false eq_tapq:replication_building_485_'n_1@127.0.0.1':backfill_start_timestamp: 1371675343 eq_tapq:replication_building_485_'n_1@127.0.0.1':flags: 85 (ack,backfill,vblist,checkpoints) eq_tapq:replication_building_485_'n_1@127.0.0.1':pending_backfill: true eq_tapq:replication_building_485_'n_1@127.0.0.1':pending_disk_backfill: true eq_tapq:replication_building_485_'n_1@127.0.0.1':queue_backfillremaining: 202

When all have completed, you should see the Total Item count ( curr_items_tot ) be equal to the number of active items multiplied by replica count. The output you see for a TAP stream after backfill completes is as follows:

eq_tapq:replication_building_485_'n_1@127.0.0.1':backfill_completed: true eq_tapq:replication_building_485_'n_1@127.0.0.1':backfill_start_timestamp: 1371675343 eq_tapq:replication_building_485_'n_1@127.0.0.1':flags: 85 (ack,backfill,vblist,checkpoints) eq_tapq:replication_building_485_'n_1@127.0.0.1':pending_backfill: false eq_tapq:replication_building_485_'n_1@127.0.0.1':pending_disk_backfill: false eq_tapq:replication_building_485_'n_1@127.0.0.1':queue_backfillremaining: 0

If you are continuously adding data to the system, these values may not correspond exactly at a given instant in time. However you should be able to determine whether there is a significant difference between the two figures.

Draining

After the backfill process is complete, all nodes that had replicas materialized on them have to persist these items to disk. It is important to continue monitoring the disk write queue and memory usage until the rebalancing operation has been completed, to ensure that the cluster is able to keep up with the write load and required disk I/O.