Ongoing monitoring and maintenance

Ongoing monitoring and maintenance

There are a number of different statistics used to monitor a cluster and diagnose and identify problems.

To understand how the cluster is working and whether it is working effectively, use the following statistics:

  • Memory Used ( mem_used ) - The current size of memory used. If mem_used hits the RAM quota, you will get OOM_ERROR . The mem_used must be less than ep_mem_high_wat , which is the mark where data is ejected from RAM.
  • Disk Write Queue Size ( ep_queue_size ) - The amount of data waiting to be written to disk.
  • Cache Hits ( get_hits ) - As a rule of thumb, this must be at least 90% of the total requests.
  • Cache Misses ( ep_bg_fetched ) - Ideally this must be low, and certainly lower than get_hits . Increasing or high values indicate that the data your application expects to be stored is not in memory.
  • No document available ( get_misses ) - Couchbase Server does not have the document.

Another key statistic to monitor cluster performance is a water mark , which determines when it is necessary to start freeing up available memory. Two important statistics related to water marks include:

  • High Water Mark ( ep_mem_high_wat ) - The system starts ejecting data out of memory when this water mark is met. Ejected values need to be fetched from disk when accessed before being returned to the client.
  • Low Water Mark ( ep_mem_low_wat ) - When the low water mark threshold is reached, it indicates that memory usage is moving toward a critical point and system administration action must be taken before the high water mark is reached.
Tip: Use the following command to get statistic information:
shell> cbstats IP:11210 all | \
    egrep "todo|ep_queue_size|_eject|mem|max_data|hits|misses"

The following statistic information is provided:

ep_flusher_todo:
ep_max_data_size:
ep_mem_high_wat:
ep_mem_low_wat:
ep_num_eject_failures:
ep_num_value_ejects:
ep_queue_size:
mem_used:
ep_bg_fetched:
get_hits:
Tip: Monitor the disk space, CPU usage, and swapping on all nodes using the standard monitoring tools.