High Availability and Disaster Recovery

High Availability and Disaster Recovery

Couchbase Server supports a range of high availability and disaster recovery (HA/DR) strategies depending on an organization’s needs, standard operating procedures, and budget.

Although Couchbase Server supports true zero downtime deployments, often less stringent downtime objectives can be supported more economically and without serious impact to business continuity for mission critical applications. The exact tools and configurations to use depend on an organization's recovery point objectives (RPO) which define how much data can be lost in case of failure, and recovery time objectives (RTO) which define how much time does the technical team have to restore the system and the data. Most HA/DR strategies rely on a multi-pronged approach of maximizing availability, increasing redundancy within and across data centers, and performing regular backups.

Availability

Couchbase Server delivers key high availability features such as zero downtime administration and maintenance, built-in data redundancy, and automatic failover.

Factors that increase system uptime and availability include:
  • Number of replicas
  • Number of racks or availability zones
  • Number of clusters
Rack zone awareness helps protect against multi-node failure events by separating active data and its replicas across "groups" which can then be mapped such that they occupy different racks, zones, or VM hosts. This is still within the context of a single Couchbase Server cluster, and the entire active data set is still evenly distributed across all nodes or groups.

Geographic Distribution and XDCR

Cross Data Center Replication (XDCR) replicates data between two or more autonomous Couchbase Server clusters. It is typically used to replicate data between clusters in different data centers and geographic regions, and can also be used to sync a second Couchbase Server cluster within the same data center.

XDCR serves an important role in high availability / disaster recovery, performance, and load distribution.
  • For disaster recovery, one or more clusters can act as hot standbys, enabling cluster-level failover by taking over load as soon as a cluster stops responding.
  • In case of serious failures, XDCR can also be used to recover data from a remote cluster. The result is similar to recovery using a backup but often faster.
  • In geographically distributed data centers, XDCR can improve performance by placing data close to end users. Users can be routed to the Couchbase Server cluster that serves their request with the lowest latency, typically the one that is physically closest to them. XDCR with bi-directional replication can create a truly global Couchbase Server deployment by allowing full read and write access at every endpoint.
    Note: XDCR with bi-directional replication does not create a global "cluster" of Couchbase Servers. Rather, each cluster is autonomous, is managed completely independently, and data is replicated to and/or from each cluster.
XDCR is simple.
It takes just two clicks to associate two clusters and two more clicks to create a replication stream from one bucket to another. Additional replication streams can be created with the same simple steps to support bi-directional replication and more complex topologies. There is no restriction on the size of any cluster involved in XDCR replication and the streams are automatically maintained through cluster topology changes and failures.
XDCR is efficient.
XDCR replicates data directly from RAM-RAM between each node of the source and destination clusters of a stream. There is no central gateway which greatly reduces complexity and allows for very high parallelization of throughput. Mutations are automatically de-duplicated to cut down on unnecessary transmissions and XDCR can be paused or resumed administratively. In order to recover from situations such as temporary disconnection between clusters, XDCR uses internal checkpoints to keep track of exactly which mutations it was able to stream. XDCR then begins streaming the mutations from the point where it left off. With regular expression filtering on key names, users can limit what is transferred between clusters, for example to partially replicate regional data from a central, global database.
XDCR is eventually consistent and features automatic conflict resolution.
In order to determine what the final state of a document should be between two clusters, XDCR will pick the version with the most revisions if there is one. Otherwise, it resolves conflicts using the CAS value, document flags, and TTL expiration value. Each cluster applies the same logic, so document consistency is maintained across the clusters.
XDCR supports any topology.
XDCR can perform unidirectional or bidirectional replication and supports spoke, circular, or star topologies of one-to-one, one-to-many or many-to-one clusters. For example, administrators can push from a central data center out to multiple data centers, aggregate data from multiple clusters to a central cluster, or make multiple data centers each function as masters. XDCR can also be used to synchronize a development or analytics cluster. Finally, XDCR is currently used to keep an Elasticsearch or SolR cluster updated for full-text indexing requirements.

For more information about managing XDCR and tuning XDCR performance, see Cross Datacenter Replication (XDCR) in the Administration guide.

Backup and restore

Backups protect against worst-case disaster situations and logical data corruption, such as an administrator accidentally deleting a bucket or a bug in the application. Backups can also be used to copy buckets for development, test or staging purposes. Couchbase Server supports zero-downtime backups and restores. Backups can be full, differential or cumulative, and Couchbase Server can restore any point to any bucket or topology.

For information on how to perform backup and restore, see Backup and Restore in the Administration guide.

Resilient applications

Application design can also help improve resilience in the face of temporary or more permanent interruptions in node availability. For example, applications can be programmed to read from replicas, with timeouts set as low as 5 milliseconds. Similarly writes can be queued within the application to be retried once the node has been failed over. When using XDCR, an application can write to any cluster in a master-master replication scheme and the changes will be replicated to the other clusters.