Database Engine Architecture

Database Engine Architecture

The memory-first architecture of the Couchbase Server enables it to maintain sub-millisecond latencies with core data access.

The Couchbase Server depends on the following key components:
  • A highly efficient listener that manages networking and authentication.
  • A bucket engine that stores and retrieves information at the speed of memory access.
With Couchbase buckets, data is stored on disk eventually through the storage engine. The storage engine enables the server to efficiently hold data much larger than the size of memory.
Figure 1. Database engine architecture

Listeners

When client connection requests arrive at the database engine, the listener service receives the requests and authenticates the client. Upon successful authentication, the listener service assigns a worker thread to the connection to service its request. A single worker thread can handle multiple client connections using a non-blocking event loop.

The number of worker threads that can be created is automatically determined based on the number of CPU threads present on the node. By default the number of worker threads is 0.75 x number of CPU threads.

vBucket manager and managed cache

After executing mutation and read requests, the server uses the managed cache to hold updated and newly created values. However, with a high flow of incoming operations, the system can run out of memory quickly. In order to reuse the memory, mutations are also queued for disk persistence. Once the mutated items are persisted, the server frees up the memory consumed by these items, making space for newer operations. This operation is called cache eviction. With a highly concurrent set of operations consuming memory and a high throughput disk subsystem persisting data to disk, there can be many pages eligible for reuse. The server uses the Least Recently Used (LRU) algorithm to identify the memory pages that can be reused.

It is important to size the RAM capacity appropriately for your working set: the portion of data that your application is working with at any given point in time and needs very low latency and high throughput access. In some applications, the working set is the entire data set, while in others it is a smaller subset.

Initialization and Warmup

Whenever you restart the Couchbase Server or restore the data, the node goes through a warmup process before it starts handling data requests again. During warmup, the Couchbase Server loads data persisted on disk into RAM.

Couchbase Server provides an optimized warmup process that loads data sequentially from disk into RAM. It divides the data to be loaded and handles it in multiple phases. After the warmup process completes, the data is available for clients to read and write. The time needed for a node warmup depends on the system size, system configuration, the amount of data persisted in the node, and the ejection policy configured for the buckets.

Couchbase Server identifies items that are frequently used, prioritizes them, and loads them before sequentially loading the remaining data. The frequently-used items are prioritized in an access log. The server performs a prefetch to get a list of the most frequently accessed keys and then fetches these keys before fetching any other items from disk.

The server runs a configurable scanner process that determines the keys that are most frequently used. The scanner process is preset and is configurable. You can use the command-line tool,cbepctl flush_param, to change the initial time and interval for the scanner process. For example, you can configure the scanner process to run during a specific time period when a given list of keys need to be identified and made available sooner.

The server can also switch into a ready mode before it has actually retrieved all documents for keys into RAM, thereby enabling data to be served before all the stored items are loaded. Switching into ready mode is a configurable setting that enables you to adjust the server warmup time.

Tunable Memory with Ejection Policy

Tunable memory enables you to configure the ejection policy for a bucket as one of the following:
  • Value-only ejection (default) removes data from the cache but keeps all keys and metadata fields for non-resident items. When a value bucket ejection occurs, the value of the item is reset. Value-only ejection, also referred to as value ejection, is well suited for cases where low latency access is critical to the application and the total item keys for the bucket can easily fit in the allocated Data RAM quota.
  • Full metadata ejection removes all data including keys, metadata, and key-value pairs from the cache for non-resident items. Full ejection is well suited for cases where the application has cold data that is not accessed frequently or the total data size is too large to fit in memory plus higher latency access to the data is accepted. The performance of full eviction cache management is significantly improved by Bloom filters. Bloom filters are enabled by default and cannot be disabled.
    Important
    Note: Full ejection may involve additional disk I/O per operation. For example, when the request get_miss which requests a key that does not exist is received, Couchbase Server will check for the key on the disk even if the bucket is 100% resident.

Working Set Management and Ejection

Couchbase Server actively manages the data stored in a caching layer; this includes the information which is frequently accessed by clients and which needs to be available for rapid reads and writes. When there are too many items in RAM, Couchbase Server removes certain data to create free space and to maintain system performance. This process is called "working set management" and the set of data in RAM is referred to as the "working set". In general, the working set consists of all the keys, metadata, and associated documents which are frequently used require fast access. The process the server performs to remove data from RAM is known as ejection.

Couchbase Server performs ejections automatically. When ejecting information, it works in conjunction with the disk persistence system to ensure that data in RAM is persisted to disk and can be safely retrieved back into RAM whenever the item is requested.

In addition to the Data RAM quota for the caching layer, the engine uses two watermarks, mem_low_wat and mem_high_wat, to determine when it needs to start persisting more data to disk.

As more and more data is held in the caching layer, at some point in time it passes the mem_low_wat value. At this point, no action is taken. As data continues to load, it eventually reaches the mem_high_wat value. At this point, the Couchbase Server schedules a background job called item pager which ensures that items are migrated to disk and memory is freed up for other Couchbase Server items. This job runs until measured memory reaches mem_low_wat. If the rate of incoming items is faster than the migration of items to disk, the system returns errors indicating there is not enough space until there is sufficient memory available. The process of migrating data from the cache to make way for actively used information is called ejection and is controlled automatically through thresholds set on each configured bucket in the Couchbase Server cluster.
Figure 2. Working set management and ejection

Depending on the ejection policy set for the bucket, the vBucket Manager removes just the document or both the document, key and the metadata for the item being ejected. Keeping an active working set with keys and metadata in RAM serves three important purposes in a system:
  • Couchbase Server uses the remaining key and metadata in RAM if a client requests for that key. Otherwise, the node tries to fetch the item from disk and return it into RAM.
  • The node can also use the keys and metadata in RAM for miss access. This means that it can quickly determine whether an item is missing and if so, perform some action, such as add it.
  • The expiration process in Couchbase Server uses the metadata in RAM to quickly scan for items that have expired and later removes them from disk. This process is known as expiry pager and runs every 60 minutes by default.

Not Recently Used (NRU) Items

All items in the server contain metadata indicating whether the item has been recently accessed or not. This metadata is known as not-recently-used (NRU). If an item has not been recently used, then the item is a candidate for ejection. When data in the cache exceeds the high water mark (mem_high_wat), the server evicts items from RAM.

Couchbase Server provides two NRU bits per item and also provides a replication protocol that can propagate items that are frequently read, but not mutated often.

NRUs are decremented or incremented by server processes to indicate an item that is more frequently or less frequently used. The following table lists the bit values with the corresponding scores and statuses:
Table 1. Scoring for NRU bit values
Binary NRU Score Access pattern Description
00 0 Set by write access to 00. Decremented by read access or no access. Most heavily used item.
01 1 Decremented by read access. Frequently accessed item.
10 2 Initial value or decremented by read access. Default value for new items.
11 3 Incremented by item pager for eviction. Less frequently used item.
There are two processes that change the NRU for an item:
  • When a client reads or writes an item, the server decrements NRU and lowers the item's score.
  • A daily process which creates a list of frequently-used items in RAM. After the completion of this process, the server increments one of the NRU bits.
Because these two processes change NRUs, they play an important role in identifying the candidate items for ejection.

You can configure the Couchbase Server settings to change the behavior during ejection. For example, you can specify the percentage of RAM to be consumed before items are ejected, or specify whether ejection should occur more frequently on replicated data than on original data. Couchbase recommends that the default settings be used.

Understanding the Item Pager

The item pager process runs periodically to remove documents from RAM. When the amount of RAM used by items reaches the high water mark (upper threshold), both active and replica data are ejected until the amount of RAM consumed (memory usage) reaches the low water mark (lower threshold). Using the default settings, active documents have a 40% chance of eviction while replica documents have a 60% chance of eviction. Both the high water mark and low water mark are expressed as a percentage amount of RAM, such as 80%.

You can change the high water mark and low water mark settings for a node by specifying a percentage amount of RAM, for example, 80%. Couchbase recommends that you use the following default settings:
Table 2. Default setting for RAM water marks
Version High water mark Low water mark
2.0 75% 60%
2.0.1 and higher 85% 75%
The item pager ejects items from RAM in two phases:
  1. Eject items based on NRU: The item pager scans NRU for items, creates a list of items with a NRU score 3, and ejects all the identified items. It then checks the RAM usage and repeats the process if the usage is still above the low water mark.
  2. Eject items based on algorithm: The item pager increments the NRU of all items by 1. For every item whose NRU is equal to 3, it generates a random number. If the random number for an item is greater than a specified probability, it ejects the item from RAM. The probability is based on the current memory usage, low water mark, and whether a vBucket is in an active or replica state. If a vBucket is in an active state, the probability of ejection is lower than if the vBucket is in a replica state.
    Table 3. Probability of ejection based on active vBuckets versus replica vBuckets (using default settings)
    Active vBucket Replica vBucket
    40% 60%

Active Memory Defragmenter

Over time, the memory used by the managed cache of a running Couchbase Server can become fragmented. The storage engine now includes an Active Defragmenter task to defragment cache memory.

Cache fragmentation is a side-effect of how Couchbase Server organizes cache memory to maximize performance. Each page in the cache is typically responsible for holding documents of a specific size range. Over time, if memory pages assigned to a specific size range become sparsely populated (due to documents of that size being ejected or items changing in size), then the unused space in those pages cannot be used for documents of other sizes until a complete page is free and that page is re-assigned to a new size. Such effects are highly workload dependent and can result in memory that cannot be used efficiently by the managed cache.

The Active Memory Defragmenter attempts to address any fragmentation by periodically scanning the cache to identify pages which are sparsely used, and repacking the items stored on those pages to free up whole pages.

High Performance Storage

The scheduler and the shared thread pool provide high performance storage to the Couchbase Server.

Scheduler
The scheduler is responsible for managing a shared thread-pool and providing a fair allocation of resources to the jobs waiting to execute in the vBucket engine. Shared thread pool services requests across all buckets.

As an administrator, you can govern the allocation of resources by configuring a bucket’s disk I/O prioritization setting to be either high or low.

Shared thread pool
A shared thread pool is a collection of threads which are shared across multiple buckets for long running operations such as disk I/O. Each node in the cluster has a thread pool that is shared across multiple vBuckets on the node. Based on the number of CPU cores on a node, the database engine spawns and allocates threads when a node instance starts up.
Using a shared thread pool provides the following benefits:
  • Better parallelism for worker threads with more efficient I/O resource management.
  • Better system scalability with more buckets being serviced with fewer worker threads.
  • Availability of task priority if the disk bucket I/O priority setting is used.

Disk I/O priority

Disk I/O priority enables workload priorities to be set at the bucket level.

You can configure the bucket priority settings at the bucket level and set the value to be either high or low. Bucket priority settings determine whether I/O tasks for a bucket must be queued in the low or high priority task queues. Threads in the global pool poll the high priority task queues more often than the low priority task queues. When a bucket has a high priority, its I/O tasks are picked up at a higher frequency and thus, processed faster than the I/O tasks belonging to a low priority bucket.

You can configure the bucket I/O priority settings during initial setup and change the settings later, if needed. However, changing a bucket I/O priority after the initial setup results in a restart of the bucket and the client connections are reset.
Figure 3. Create bucket settings

The previous versions of Couchbase Server, version 3.0 or earlier, required the I/O thread allocation per bucket to be configured manually. However, when you upgrade from a 2.x version to a 3.x or higher version, Couchbase Server converts an existing thread value to either a high or low priority based on the following criteria:
  • Buckets allocated six to eight (6-8) threads in Couchbase Server 2.x are marked high priority in bucket setting after the upgrade to 3.x or later.
  • Buckets allocated three to five (3-5) threads in Couchbase Server 2.x are marked low priority in bucket settings after the upgrade to 3.x or later.

Monitoring Scheduler

You can use the cbstats command with the raw workload option to view the status of the threads as shown in the following example.
# cbstats 10.5.2.54:11210 -b default raw workload
     
     ep_workload:LowPrioQ_AuxIO:InQsize:   3
     ep_workload:LowPrioQ_AuxIO:OutQsize:  0
     ep_workload:LowPrioQ_NonIO:InQsize:   33
     ep_workload:LowPrioQ_NonIO:OutQsize:  0
     ep_workload:LowPrioQ_Reader:InQsize:  12
     ep_workload:LowPrioQ_Reader:OutQsize: 0
     ep_workload:LowPrioQ_Writer:InQsize:  15
     ep_workload:LowPrioQ_Writer:OutQsize: 0
     ep_workload:num_auxio:                1
     ep_workload:num_nonio:                1
     ep_workload:num_readers:              1
     ep_workload:num_shards:               4
     ep_workload:num_sleepers:             4
     ep_workload:num_writers:              1
     ep_workload:ready_tasks:              0
     ep_workload:shard0_locked:            false
     ep_workload:shard0_pendingTasks:      0
     ep_workload:shard1_locked:            false
     ep_workload:shard1_pendingTasks:      0
     ep_workload:shard2_locked:            false
     ep_workload:shard2_pendingTasks:      0
     ep_workload:shard3_locked:            false
     ep_workload:shard3_pendingTasks:      0