High Availability and Replication Architecture
Couchbase Server provides high availability for reading and for writing of data through a variety of features. For writing, the ability to get data off of a single node as quickly as possible is paramount to avoid any data loss due to a failure of that individual node.
Database Change Protocol (DCP)
Database Change Protocol (DCP) is the protocol used to stream bucket level mutations. Given the distributed nature of Couchbase Server, DCP sits at the heart of Couchbase Server architecture. DCP is used for high speed replication of mutations to maintain replica vBuckets, incremental MapReduce views and spatial views, Global Secondary Indexes (GSIs), cross datacenter replication (XDCR), backups, and many other external connectors.
DCP is a memory based replication protocol that is ordering, resumable, and consistent. DCP immediately streams any changes made to documents in memory to the destination. The memory based communication reduces latency and greatly boosts availability, prevents data loss, improves freshness of indexes, and more.
- Application client
- A normal client that transmits read, write, update, delete, and query requests to the server cluster, usually for an interactive web application.
- DCP client
- A special client that streams data from one or more Couchbase server nodes, for purposes of intra-cluster replication (to be a backup in case the master server fails), indexing (to answer queries in aggregate about the data in the whole cluster), XDCR (to replicate data from one cluster to another cluster, usually located in a separate data center), incremental backup, and any 3rd party component that wants to index, monitor, or analyze Couchbase data in near real time, or in batch mode on a schedule.
- Failover log
- A list of previously known vBucket versions for a vBucket. If a client connects to a server and was previously connected to a different version of a vBucket than that server is currently working with, the failure log is used to find a rollback point.
- History branch
- Whenever a node becomes the master node for a vBucket in the event of a failover or uncontrolled shutdown and restart, if it was not the farthest ahead of all processes watching events on that partition and starts taking mutations, it might reuse sequence numbers that other processes have already seen on this partition. This can be a history branch, and the new master must assign the vBucket a new vBucket version so that DCP clients in the distributed system can recognize that they are ahead of the new master and roll back changes at the point this happened in the stream. During a controlled handover from an old master to a new master, the sequence history cannot have branches, so there is no need to assign a new version to the vBucket being handed off. Controlled handovers occur in the case of a rebalance for elasticity (such as adding or removing a node) or a swap rebalance in the case of an upgrade (such as adding a new version of Couchbase Server to a cluster or removing an old version of Couchbase Server).
- A mutation is an event that deletes a key or changes the value a key points to. Mutations occur when transactions such as create, update, delete or expire are executed.
- Rollback point
- The server uses the failover log to find the first possible history branch between the last time a client was receiving mutations for a vBucket and now. The sequence number of that history branch is the rollback point that is sent to the client.
- Sequence number
- Each mutation that occurs on a vBucket is assigned a number, which strictly increases as events are assigned numbers (there is no harm in skipping numbers, but they must increase), that can be used to order that event against other mutations within the same vBucket. This does not give a cluster-wide ordering of events, but it does enable processes watching events on a vBucket to resume where they left off after a disconnect.
- A master or replica node that serves as the network storage component of a cluster. For a given partition, only one node can be master in the cluster. If that node fails or becomes unresponsive, the cluster selects a replica node to become the new master.
- To send a client a consistent picture of the data it has, the server takes a snapshot of the state of its disk write queue or the state of its storage, depending on where it needs to read from to satisfy the client’s current requests. This snapshot represents the exact state of the mutations it contains at the time it was taken. Using this snapshot, the server can send the items that existed at the point in time the snapshot was taken, and only those items, in the state they were in when the snapshot was taken. Snapshots do not imply that everything is locked or copied into a new structure. In the current Couchbase storage subsystem, snapshots are essentially "free." The only cost is when a file is copy compacted to remove garbage and wasted space, the old file cannot be freed until all snapshot holders have released the old file. It’s also possible to "kick" a snapshot holder if the system determines the holder of the snapshot is taking too long. DCP clients that are kicked can reconnect and a new snapshot will be obtained, allowing it to restart from where it left off.
- Couchbase splits the key space into a fixed amount of vBuckets, usually 1024. Keys are deterministically assigned to a vBucket, and vBuckets are assigned to nodes to balance the load across the cluster.
- vBucket stream
- A grouping of messages related to receiving mutations for a specific vBucket. This includes mutation, deletion, and expiration messages and snapshot marker messages. The transport layer provides a way to separate and multiplex multiple streams of information for different vBuckets. All messages between snapshot marker messages are considered to be one snapshot. A snapshot contains only the recent update for any given key within the snapshot window. It might require several complete snapshots to get the current version of the document.
- vBucket version
- A universally unique identifier (UUID) and sequence number pair associated with a vBucket. A new version is assigned to a vBucket by the new master node any time there might have been a history branch. The UUID is a randomly generated number, and the sequence number is the sequence number that vBucket last processed at the time the version was created.
Intra-cluster replication involves replicas that are placed on another node in the same cluster.
Replicas are copies of data that are placed on another node in a cluster. The source of the replicated vBucket data is called the active vBucket. Active vBuckets perform read and write operations on individual documents. The destination vBucket is called the replica vBucket. Replica vBuckets receive a continuous stream of mutations from the active vBucket through the Database Change Protocol (DCP). Although replica vBuckets are not accessed typically, they can respond to read requests.
Cross Datacenter Replication
Using the cross datacenter replication (XDCR) capability you can set up replication of data between clusters. XDCR helps protect against data center failures and also helps maintain data locality in globally distributed mission critical applications.
As an administrator, you can use XDCR to create replication relationships that replicate data from a source cluster’s bucket to a destination cluster’s bucket. You can also set up complex topologies across many clusters such as bidirectional topologies, ring topologies, tree structured topologies, and more.
In XDCR, each replication stream is set up between a source and destination bucket on separate clusters. Each bucket on each cluster can be a source or a destination for many replication definitions in XDCR. XDCR is a "push-based" replication and so each source node runs the XDCR agent and pushes mutations to the destination bucket.
The XDCR agent on the source node uses direct access communication (XMem) protocol to propagate mutations from the source vBucket to the matching vBucket on the destination cluster. Since there are equal number of vBuckets (default is 1024) on both the source and the destination clusters, there is a one-to-one match for each source and destination vBucket.
It is important to note that XDCR does not require source and destination clusters to have identical topology. XDCR agents are topology aware and match the destination vBucket with the local vBucket, propagating mutations directly from vBucket to vBucket.
Conflict Resolution in XDCR
When the same dataset is being mutated on both ends of an XDCR replication (source and remote), then there is a high probability that conflicts could occur. XDCR automatically performs conflict resolution for different document versions on source and destination clusters.
The algorithm is designed to consistently select the same document on either a source or destination cluster. For each stored document, XDCR perform checks of metadata to resolve conflicts. It checks the following:
Revision ID, a numerical sequence that is incremented on each mutation
Expiration (TTL) value
XDCR conflict resolution uses revision ID as the first field to resolve conflicts between two writes across clusters. Revision IDs are maintained per key and are incremented with every update to the key. Revision IDs keep track of number of mutations to a key, thus XDCR conflict resolution can be best characterized as "the most updates wins".
If a document does not have the highest revision number, changes to this document will not be stored or replicated; instead the document with the highest score will take precedence on both clusters. Conflict resolution is automatic and does not require any manual correction or selection of documents.
By default XDCR fetches metadata once for every document that it replicates, before it replicates the document to the destination cluster. XDCR fetches metadata on the source cluster and looks at the number of revisions for a document. It compares this number with the number of revisions on the destination cluster and the document with more revisions is considered the ‘winner.’
If XDCR determines a document from a source cluster will win conflict resolution, it puts the document into the replication queue. If the document will lose conflict resolution because it has a lower number of mutations, XDCR will not put it into the replication queue. Once the replicated document reaches the destination, conflict resolution will be performed again by the remote cluster to ensure that the correct document 'wins'. This is to ensure that the correct version of the document 'wins' even if the document on the remote cluster has changed since the initial replication from the source cluster. If the document from the source cluster is still the ‘winner’ it will be persisted onto disk at the destination. The destination cluster will discard the document version with the lowest number of mutations.
The key point is that the number of document mutations is the main factor that determines whether XDCR keeps a document version or not. This means that the document that has the most recent mutation may not be necessarily the one that wins conflict resolution. If both documents have the same number of mutations, XDCR selects a winner based on other document metadata. Precisely determining which document is the most recently changed is often difficult in a distributed system. The algorithm Couchbase Server uses does ensure that each cluster can independently reach a consistent decision on which document wins.