Bloom Filters

Bloom Filters

A Bloom filter is a probabilistic data structure used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not. This means a query returns either "possibly in set" or "definitely not in set". It is a bit array with a predefined size that is calculated based on the expected number of items and the probability of false positives or the probability of finding a key that doesn't exist. Bloom filter significantly improves the performance of full ejection scenarios and XDCR.

In the full ejection mode, the key and metadata are evicted along with the value. Therefore, if a key is non resident, there is no way of knowing if a key exists or not, without accessing the disk. In such a scenario, if a client issues a lot of GETs on keys that may not even exist in server, Bloom filters help eliminate many unnecessary disk accesses. Similarly for XDCR, when we set up remote replication to a brand new cluster, we would be able to avoid many unnecessary GetMeta-disk-fetches with the help of the bloom filter.

With Bloom filters, the probability of false positives decreases as the size of the array increases and increases as the number of inserted elements increases. Based on the algorithm that takes into account the number of keys and the probability of false positives, you can estimate the size of the Bloom filter and the number of bits to store each key.

For value eviction only the deleted keys will be stored, while in case of full eviction both the deleted keys and non-resident items will be stored.

The algorithm calculates the almost exact probability of false positives, including the number of hash functions (k), size of the bit array (m), and the number of inserted elements (n):

    
      k = m/n (ln 2)
You can expect an increase in memory usage or memory overhead while using the Bloom filter:
Table 1. Memory overhead for Bloom filter use
False positive probability 0.01 0.05
Estimated number of keys 10,000.000 (=> =10,000 keys per vBucket) 10,000.000 (=> =10,000 keys per vBucket)
Number of bits per key in the filter 7 bits 4 bits
Size of the Bloom filter to fit the estimated keys with desired false positive probability 95851 bits (=> =12 KB per vBucket) (=> =12 MB for 1024 vBuckets) 62353 bits (=> =8 KB per vBucket) (=> =8 MB for 1024 vBuckets)

In a case of full eviction, you will not know whether an item exists in the memory until you perform a background fetch. Therefore, use of the Bloom filter helps to avoid unnecessary background fetches and improves latency.

For more information about working set management and eviction, see Database Engine Architecture