Sizing guidelines

Sizing guidelines

The primary considerations for sizing a Couchbase Server cluster are the number of nodes and node size.

When sizing your Couchbase Server cluster, ask the following questions:

  • How many nodes do I need?
  • How large (RAM, CPU, disk space) must those nodes be?

To determine the number of nodes needed for a cluster, consider the following:

  • RAM
  • Disk throughput and sizing
  • Network bandwidth
  • Data distribution and safety

Due to the in-memory nature of Couchbase Server, RAM is usually the determining factor for sizing. However, the primary sizing factor depends on the actual data set and the information being stored. For example:

  • If you have a very small data set with a very high load, you must calculate the size based more on the network bandwidth than RAM.
  • If you have a very high write rate, you’ll need more nodes to support the disk throughput that is needed to persist all that data (and likely more RAM to buffer the incoming writes).
  • Even with a very small dataset under low load, you might want three nodes for proper distribution and safety.

You can increase the capacity of your cluster (RAM, disk, CPU, or network) by increasing the number of nodes within the cluster. Each limit will be increased linearly as the cluster size is increased.

RAM sizing
RAM is usually the most critical sizing parameter. It is also the one that can have the most impact on performance and stability.
Working set
The working set is the data that the client application actively uses at any point in time. Ideally, all of the working set lives in memory, which impacts the amount of needed memory.
Memory quota
It is very important that your Couchbase cluster’s size corresponds to the working set size and total data you expect. The goal is to size the available RAM so that all your document IDs, the document ID meta data, and the working set values fit. The memory must rest just below the point at which Couchbase Server will start evicting values to disk (the High Water Mark).
Important: You will not be able to allocate all your machine RAM to a Couchbase Server node ( per_node_ram_quota parameter) because other programs might be running on your machine.

How much memory and disk space per node you will need depends on several different variables.

Important: The maximal memory quota that Couchbase Server allows is computed from the total size of RAM in your machines. For optimal performance, it is recommended to leave a buffer from at least 20% between the quota size and the physical RAM size. For workloads with views, it is advised to leave even more memory for operating system's page cache. For such configurations, it is currently recommended to set memory quota to 60% of the physical RAM size.

The following are per-bucket calculations and must be summed up across all buckets. If all the buckets have the same configuration, treat the total data as a single bucket. There is no per-bucket overhead that needs to be considered.

Variable Description
documents_num The total number of documents you expect in your working set
ID_size The average size of document IDs
value_size The average size of values
number_of_replicas The number of copies of the original data you want to keep
working_set_percentage The percentage of your data you want in memory
per_node_ram_quota How much RAM can be assigned to Couchbase

Use the following items to calculate how much memory you need:

Constant Description
Metadata per document (metadata_per_document) This is the amount of memory that Couchbase needs to store metadata per document. Metadata uses 56 bytes. All the metadata needs to live in memory while a node is running and serving data.
SSD or Spinning SSDs give better I/O performance.
headroom The cluster needs additional overhead to store metadata called headroom , which requires approximately 25–30% more space than the raw RAM requirements for your dataset. Since SSDs are faster than spinning (traditional) hard disks, you should set aside 25% of memory for SSDs and 30% of memory for spinning hard disks.
High Water Mark (high_water_mark) By default, the high water mark for a node’s RAM is set at 85%.

The rough guidelines to size your cluster are as follows:

Variable Calculation
no_of_copies 1 + number_of_replicas
total_metadata . All the documents need to live in the memory. (documents_num) * (metadata_per_document + ID_size) * (no_of_copies)
total_dataset (documents_num) * (value_size) * (no_of_copies)
working_set total_dataset * (working_set_percentage)
Cluster RAM quota required (total_metadata + working_set) * (1 + headroom) / (high_water_mark)
number of nodes Cluster RAM quota required / per_node_ram_quota
Important: You will need at least the number of replicas + 1 nodes regardless of your data size.

The following is a sample sizing calculation:

Input Variable value
documents_num 1,000,000
ID_size 100
value_size 10,000
number_of_replicas 1
working_set_percentage 20%
Constants value
Type of Storage SSD
overhead_percentage 25%
metadata_per_document 56 for 2.1 and higher, 64 for 2.0.x
high_water_mark 85%
Variable Calculation
no_of_copies = 1 for original and 1 for replica
total_metadata = 1,000,000 * (100 + 56) * (2) = 312,000,000
total_dataset = 1,000,000 * (10,000) * (2) = 20,000,000,000
working_set = 20,000,000,000 * (0.2) = 4,000,000,000
Cluster RAM quota required = (312,000,000 + 4,000,000,000) * (1+0.25)/(0.85) = 6,341,176,470

For example, if you have 8GB machines and you want to use 6 GB for Couchbase:

number of nodes =
    Cluster RAM quota required/per_node_ram_quota =
    7.9 GB/6GB = 1.3 or 2 nodes

Network bandwidth

Network bandwidth is not normally a significant factor to consider for cluster sizing. However, clients require network bandwidth to access information in the cluster. Nodes also need network bandwidth to exchange information (node to node).

In general, calculate your network bandwidth requirements using the following formula:

Bandwidth = (operations per second * item size) + overhead for rebalancing 

Calculate the operations per second with the following formula:

Operations per second = Application reads + (Application writes * Replica copies) 

Data safety

Make sure you have enough nodes (and the right configuration) in your cluster to keep your data safe. There are two areas to keep in mind: how you distribute data across nodes and how many replicas you store across your cluster.

Data distribution

More nodes are better than less. If you only have two nodes, your data is split across the two nodes, and half of your dataset is impacted if one goes away. On the other hand, with ten nodes, only 10% of the dataset is impacted if one node goes away. Even with automatic failover, there still is a period when data is unavailable if nodes fail, which is mitigated by having more nodes.

After a failover, the cluster takes on an extra load. The question is - how heavy is that extra load and are you prepared for it? Again, with only two nodes, each one needs to be ready to handle the entire load. With ten, each node only needs to be able to take on an extra tenth of the workload should one fail.

While two nodes does provide a minimal level of redundancy, we recommend that you always use at least three nodes.

Replication

Couchbase Server authorizes you to configure up to three replicas (creating four copies of the dataset). In the event of a failure, you can only “fail over” (either manually or automatically) as many nodes as you have replicas. For example:

  • In a five node cluster with one replica, if one node goes down, you can fail it over. If a second node goes down, you no longer have enough replica copies to fail over to and will have to go through a slower process to recover.
  • In a five node cluster with two replicas, if one node goes down, you can fail it over. If a second node goes down, you can fail it over as well. Should a third one go down, you now no longer have replicas to fail over.

After a node goes down and is failed over, try to replace that node as soon as possible and rebalance. The rebalance recreates the replica copies (if you still have enough nodes to do so).

Tip: As a rule of thumb, configure the following:
  • One replica for up to five nodes.
  • One or two replicas for five to ten nodes.
  • One, two, or three replicas for over ten nodes.
While there can be variations to this, there are diminishing returns from having more replicas in smaller clusters.

Hardware requirements

In general, Couchbase Server has very low hardware requirements and is designed to be run on commodity or virtualized systems. However, as a rough guide to the primary concerns for your servers, the following is recommended:

  • RAM: This is your primary consideration. We use RAM to store active items, and that is the key reason Couchbase Server has such low latency.
  • CPU: Couchbase Server has very low CPU requirements. The server is multi-threaded and, therefore, benefits from a multi-core system. We recommend machines with at least four or eight physical cores.
  • Disk: By decoupling the RAM from the I/O layer, Couchbase Server can support low-performance disks better than other databases. As a best practice, have separate devices for server install, data directories, and index directories.
  • Network: Most configurations work with Gigabit Ethernet interfaces. Faster solutions such as 10GBit and Infiniband will provide spare capacity.

Known working configurations include SAN, SAS, SATA, SSD, and EBS with the following recommendations:

  • SSDs have been shown to provide a great performance boost both in terms of draining the write queue and also in restoring data from disk (either on cold-boot or for purposes of rebalancing).
  • RAID provides better throughput and reliability.
  • Striping across EBS volumes (in Amazon EC2) has been shown to increase throughput.

Considerations for Cloud environments (that is Amazon EC2)

Due to the unreliability and general lack of consistent I/O performance in cloud environments, you should lower the per-node RAM footprint and increase the number of nodes. By doing that, you can achieve a better disk throughput and improve rebalancing since each node will have to store (and transmit) less data. When you further distribute the data, the impact of losing a single node will be diminished.