Just as monitoring Couchbase and the individual services, buckets, indexes, etc. is extremely important to have a solid understanding of overall cluster health, it is also important to monitor the operating system and various stats for each node in the cluster. Each operating system has varying means of retrieving these metrics and many monitoring solutions collect them out of the box.
OS Metric | Response |
---|---|
Free RAM | Free + cache memory should always be at least 20% of total system memory. If free + cache memory falls below 20%, scale the cluster. |
Swap usage | Swap usage should always be zero. If swap is used, it means the OS is under very high memory pressure and unable to purge dirty pages fast enough and the cluster should be scaled. |
Memcached process RAM usage | Create a baseline for this value as "normal" will be dependent upon your working set. Alert if this value exceeds 150% of baseline. This may indicate an unusual increase in write traffic, reading of typically cold data, or possible malloc fragmentation. Confirm the Couchbase resident ratios are still correct. Add memory or scale the cluster if necessary. |
Beam.smp process RAM usage | Create a baseline for this value as "normal" will be dependent upon your cluster size and API activity levels. Alert if this value exceeds 120% of baseline. This may indicate a memory leak in the beam process. Contact Couchbase Support if larger than a few gigabytes. |
IO utilization (iostat) | Create a baseline for this value as "normal" will be dependent upon your workload and available disk IO. Overall sustained IO utilization should not exceed 90% of total IO capacity. |
Total CPU utilization | Create a baseline for this value as "normal" will be dependent upon your workload. Sustained CPU utilization >90% indicates a need to scale the cluster. |
Couchbase service CPU utilization | Create a baseline for these values as "normal" will be dependent upon your workload. Alert if this value exceeds 2x of baseline. |
Beam.smp CPU utilization | Create a baseline for this value as "normal" will be dependent upon your workload. Alert if this value exceeds 2x of baseline |
%steal CPU | This value should always be zero. Anything above zero indicates the VM hypervisor is oversubscribed. Additional physical hosts should be added or collocated VMs should be migrated to other hosts. |
Network utilization | Create a baseline for this value as "normal" will be dependent upon your workload. Alert if this value exceeds 120% of baseline. If the sustained utilization is above 80% of the total available bandwidth, it indicates the need to scale the cluster. |
Presence of beam.smp process | Alert if beam.smp is not present. This indicates Couchbase is offline and needs to be restarted. |
Presence of service processes | Alert if data/index/query/fts/eventing/analytics processes are not present. This indicates Couchbase is either offline, starting up, or services may have crashed and need to be restarted. Below are the processes by service:
|
NTP clock skew | Couchbase requires all cluster nodes (and any replicated clusters) to have their system clocks synchronized to a common clock source. Monitor clock skew on each server and alert if it is more than 1 minute out of sync. |
The following Operating System stats are available via the Cluster-Wide or Per-Node Endpoints listed below.
Stat name | Description |
---|---|
allocstall | Number of allocations stalled when reclaiming |
cpu_cores_available | Number of CPU cores available in the cluster or the node |
cpu_irq_rate | The CPU interrupt request rate |
cpu_stolen_rate | CPU steal rate |
cpu_idle_ms | The amount of time the CPU has been idle |
cpu_local_ms | |
cpu_utilization_rate | Max CPU utilization % |
hibernated_requests | Idle streaming requests |
hibernated_waked | Streaming wakeups/sec |
mem_actual_free | Amount of RAM available on this server |
mem_actual_used | Amount of RAM used on this server |
mem_free | Amount of RAM available on this server |
mem_limit | The limit for RAM |
mem_total | Amount of RAM used on this server |
mem_used_sys | Amount of RAM available to the OS |
odp_report_failed | |
rest_requests | Management port reqs/sec |
swap_total | Amount of swap space available on this server |
swap_used | Amount of swap space in use on this server |
GET
Cluster System StatsThese endpoints are informational and should not be used for monitoring as they are an aggregate for the entire cluster and the best practice is to monitor each node individually.
curl \
--user Administrator:password \
--silent \
--request GET \
--data zoom=minute \
http://localhost:8091/pools/default/buckets/@query/stats | \
jq -r '.op.samples | to_entries[] | select(.key != "timestamp") |
.key + ": " + (.value | add / length | tostring)'
GET
Node-Level OS StatsEach node in the cluster should be monitoring individually using the endpoint listed below.
The following example demonstrates how to retrieve the system stats for the cluster.
NODE="172.17.0.2:8091"
curl \
--user Administrator:password \
--silent \
--request GET \
--data zoom=minute \
http://localhost:8091/pools/default/buckets/@system/nodes/$NODE/stats | \
jq -r -c '.op.samples |
" cpu_idle_ms: " + (.cpu_idle_ms | add / length | tostring) +
"\n cpu_local_ms: " + (.cpu_local_ms | add / length | tostring) +
"\n cpu_utilization_rate: " + (.cpu_utilization_rate | add / length | tostring) +
"\n hibernated_requests: " + (.hibernated_requests | add / length | tostring) +
"\n hibernated_waked: " + (.hibernated_waked | add / length | tostring) +
"\n mem_actual_free: " + (.mem_actual_free | add / length | tostring) +
"\n mem_actual_used: " + (.mem_actual_used | add / length | tostring) +
"\n mem_free: " + (.mem_free | add / length | tostring) +
"\n mem_total: " + (.mem_total | add / length | tostring) +
"\n mem_used_sys: " + (.mem_used_sys | add / length | tostring) +
"\n rest_requests: " + (.rest_requests | add / length | tostring) +
"\n swap_total: " + (.swap_total | add / length | tostring) +
"\n swap_used: " + (.swap_used | add / length | tostring)'
# loop over each of the buckets
for node in $(curl \
--user Administrator:password \
--silent \
--request GET \
http://localhost:8091/pools/nodes | \
jq -r '.nodes[] |
.hostname'
)
do
echo "$node OS Stats"
echo "-------------------------------------------------------"
# get the system stats for the specific node
curl \
--user Administrator:password \
--silent \
--request GET \
--data zoom=minute \
http://localhost:8091/pools/default/buckets/@system/nodes/$node/stats | \
jq -r '.op.samples | to_entries[] | select(.key != "timestamp") |
.key + ": " + (.value | add / length | tostring)'
done