Cardinality and selectivity play a crucial role in index tuning and optimization as they can provide measurable insights into your data set and effectiveness of the index, pointing you to where specific optimizations can be made.
Cardinality refers to the individual uniqueness of values in a specific index key. Each index key (document property) emitted into the index will have varying degrees of cardinality. Cardinality can be broken down into roughly 3 different types:
Selectivity is the measure of variation in unique values in a given data set and it is represented as a number between 0 - 1
or 0 - 100%
. The formula to calculate selectivity can be represented as follows:
selectivity = cardinality/(number of records) * 100
or more simply stated:
Number of Distinct Values / Total number of Records = Selectivity
Cardinality and Selectivity can be applied to any "data set" such as an index, query or bucket. In general for database indexes, the higher cardinality -> better selectivity -> faster scans -> increased performance. Consider the table below:
Name | Breed | Gender | Origin Country | |
---|---|---|---|---|
1 | Oakley | German Shepherd | M | Germany |
2 | Zeus | Doberman Pinscher | M | Germany |
3 | Darby | Doberman Pinscher | F | Germany |
4 | Rocky | Bulldog | M | United Kingdom |
5 | Lucy | Labrador Retriever | F | Canada |
6 | Buddy | Golden Retriever | M | United Kingdom |
7 | Molly | Pug | F | China |
8 | Sadie | Labrador Retriever | F | Canada |
9 | Max | Boxer | M | Germany |
10 | Simba | Great Dane | M | Germany |
Cardinality | 10 | 6 | 2 | 4 |
Selectivity | 100% | 60% | 20% | 40% |
Using the travel-sample
bucket, we'll calculate two selectivity values for some of the sample indexes:
WHERE
predicate and contain the leading field. This is often referred to as "index segmentation".For optimum performance, you will want a relatively low percentage of Projection Selectivity as this means the index is smaller, and a higher value for Index selectivity as this means there is a lot of uniqueness within the index.
Initially, we need to get the total # of documents in the bucket, as we will reuse this value in all of our calculations:
SELECT RAW COUNT(1)
FROM `travel-sample`
[31591]
def_type
indexCREATE INDEX `def_type` ON `travel-sample`(`type`)
Determine the total number of records in the index, this query will push the COUNT()
down to the indexer, and we trigger the use of the index by referencing the first field in the index. If needed you could optionally specify a USE INDEX()
statement to ensure the index is used:
SELECT COUNT(1)
FROM `travel-sample`
WHERE type IS NOT MISSING
[31591]
Next, we need to determine the total number of possible unique values in the index:
SELECT RAW COUNT(DISTINCT type)
FROM `travel-sample`
WHERE type IS NOT MISSING
[5]
Description | Formula | Selectivity |
---|---|---|
Projection Selectivity | (31591 / 31591) * 100 | 100% |
Index Selectivity | (5 / 31591) * 100 | 0.015% |
def_faa
indexCREATE INDEX `def_faa` ON `travel-sample`(`faa`)
Determine the total number of records in the index:
SELECT RAW COUNT(1)
FROM `travel-sample`
WHERE faa IS NOT MISSING
[1968]
Next, we need to determine the total number of possible unique values in the index:
SELECT RAW COUNT(DISTINCT faa)
FROM `travel-sample`
WHERE faa IS NOT MISSING
[1708]
Description | Formula | Selectivity |
---|---|---|
Projection Selectivity | (1968 / 31591) * 100 | 6.23% |
Index Selectivity | (1708 / 1968) * 100 | 86.79% |
def_country
indexCREATE INDEX `def_country` ON `travel-sample`(`country`, `type`)
Determine the total number of records in the index:
SELECT RAW COUNT(1)
FROM `travel-sample`
WHERE country IS NOT MISSING
[7567]
Next, we need to determine the total number of possible unique values in the index. For this example, however, there are two index keys country
and type
. The selectivity depends on how the index will be used and when optimizing it is important to understand how the cardinality of one key can affect the other.
SELECT *
FROM `travel-sample`
WHERE country = 'United States'
SELECT RAW COUNT(DISTINCT country)
FROM `travel-sample`
WHERE country IS NOT MISSING
[3]
SELECT *
FROM `travel-sample`
WHERE country = 'United States' AND type = 'landmark'
When both keys are used, the selectivity can be described in two ways, the first is the total uniqueness of both keys when combined together:
SELECT RAW COUNT(DISTINCT country || type)
FROM `travel-sample`
WHERE country IS NOT MISSING
[12]
The second is # of unique keys for the second index key which matches the previous index key:
SELECT RAW COUNT(1)
FROM `travel-sample`
WHERE country = 'United States'
[3948]
SELECT RAW COUNT(DISTINCT type)
FROM `travel-sample`
WHERE country = 'United States'
[4]
Description | Formula | Selectivity |
---|---|---|
Projection Selectivity | (7567 / 31591) * 100 | 23.95% |
Index Selectivity (3.a) | (3 / 7567) * 100 | 0.039% |
Index Selectivity (3.b) | (12 / 7567) * 100 | 0.16% |
Index Selectivity (3.c) | (4 / 3948) * 100 | 0.10% |
Both cardinality and selectivity can affect the performance of IndexScans, and you should always consider their implications as it relates to your access patterns and query predicates. Having a solid understanding of cardinality and selectivity as it relates to your data set can provide solid guidance in the tuning and determining the order of index keys within the index.