Indexing JSON Documents and Query Performance
Creating the right index - with the right keys, in the right order, and using the right expressions, is critical to query performance in any database system. This is true for Couchbase as well. This topic provides an overview of the indexing options for JSON in Couchbase, which in turn would help query for data efficiently and improve query performance.
- Standard global secondary index
- Memory-optimized global secondary index
The standard global secondary index stores uses the ForestDB storage engine to store the B-Tree index and keeps the optimal working set of data in the buffer. This means, the total size of the index can be much bigger than the amount of memory available in each index node.
A memory-optimized index uses a lock-free skiplist to maintain the index and keeps all the index data in memory. A memory-optimized index has better latency for index scans and processes the mutations of the data much faster.
Both standard and memory-optimized indexes implement multi-version concurrency control (MVCC) to provide consistent index scan results and high throughput.
During cluster installation, you can choose the type of index you'd like to use in the cluster.
The examples in this topic use the travel-sample dataset which is shipped with Couchbase Server. Ensure that you've installed the travel-sample bucket. For instructions on how to install the sample bucket, see Sample Buckets.
SELECT meta().id, travel
FROM `travel-sample` travel
WHERE type = 'airline'
LIMIT 1;
[
{
"id": "airline_10",
"travel": {
"callsign": "MILE-AIR",
"country": "United States",
"iata": "Q5",
"icao": "MLA",
"id": 10,
"name": "40-Mile Air",
"type": "airline"
}
}
]
Primary Index
CREATE PRIMARY INDEX ON `travel-sample`;
The Couchbase data layer enforces the uniqueness constraint on the document key. The primary index, like every other index, is maintained asynchronously. Use the primary index for full bucket scans (primary scans) when the query does not have any filters (predicates) or when no other index or access path can be used.
SELECT * FROM system:indexes WHERE name = '#primary';
[
"indexes": {
"datastore_id": "http://127.0.0.1:8091",
"id": "f6e3c75d6f396e7d",
"index_key": [],
"is_primary": true,
"keyspace_id": "travel-sample",
"name": "#primary",
"namespace_id": "default",
"state": "online",
"using": "gsi"
}
]
The metadata provides additional information about the index like where does the index reside (datastore_id), its state (state), and the indexing method used (using).
Named Primary Index
CREATE PRIMARY INDEX `def_primary` ON `travel-sample`;
The rest of the features of the primary index are the same, except that this index is named. The advantage of naming a primary index is that you can have multiple primary indexes in the system. Duplicate indexes help with high availability as well as query load distribution between the indexes. This is true for both primary indexes and secondary indexes.
SELECT meta().id AS documentkey, `travel-sample` airline
FROM `travel-sample`
WHERE type = 'airline'
LIMIT 1;
[
{
"airline": {
"callsign": "MILE-AIR",
"country": "United States",
"iata": "Q5",
"icao": "MLA",
"id": 10,
"name": "40-Mile Air",
"type": "airline"
},
"documentkey": "airline_10"
}
]
Secondary Index
The secondary index is an index on any key-value or document-key. This index can use any key within the document and the key can be of any type: scalar, object, or array. The query has to use the same type of object for the query engine to use the index.
CREATE INDEX travel_name ON `travel-sample`(name);
# Name is a simple scalar value such as:
# { "name": "Air France" }
CREATE INDEX travel_geo on `travel-sample`(geo);
# get is an object embedded within the document such as:
# "geo": {
# "alt": 12,
# "lat": 50.962097,
# "lon": 1.954764
# }
CREATE INDEX travel_geo on `travel-sample`(geo.alt);
CREATE INDEX travel_geo on `travel-sample`(geo.lat);
# Schedule is an array of objects with flight details. This command indexes the
# complete array and is useful only if you're looking for the entire array.
CREATE INDEX travel_schedule ON `travel-sample`(schedule);
Example:
"schedule": [
{
"day": 0,
"flight": "AF198",
"utc": "10:13:00"
},
{
"day": 0,
"flight": "AF547",
"utc": "19:14:00"
},
{
"day": 0,
"flight": "AF943",
"utc": "01:31:00"
},
{
"day": 1,
"flight": "AF356",
"utc": "12:40:00"
},
{
"day": 1,
"flight": "AF480",
"utc": "08:58:00"
},
{
"day": 1,
"flight": "AF250",
"utc": "12:59:00"
}
]
Composite Secondary Index
CREATE INDEX travel_info ON `travel-sample`(name,type,id,icoo,iata);
Each of the keys can be a simple scalar field, object, or an array. For the index filtering to be exploited, the filters have to use respective object type in the query filter.
The keys to the secondary indexes can include document keys (meta().id) explicitly if you need to filter on the document keys in the index.
Functional Index
CREATE INDEX travel_cxname ON `travel-sample`(LOWER(name));
EXPLAIN SELECT * FROM `travel-sample` WHERE LOWER(name) = "john";
{
"#operator": "IndexScan",
"index": "travel_cxname",
"index_id": "2f39d3b7aac6bbfe",
"keyspace": "travel-sample",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"\"john\""
],
"Inclusion": 3,
"Low": [
"\"john\""
]
}
}
],
CREATE INDEX travel_cx1 ON `travel-sample`(LOWER(name), length*width, round(salary));
Array Index
schedule:
[
{
"day" : 0,
"special_flights" :
[
{
"flight" : "AI111", "utc" : "1:11:11"
},
{
"flight" : "AI222", "utc" : "2:22:22"
}
]
},
{
"day": 1,
"flight": "AF552",
"utc": "14:41:00"
}
]
With a rich structure as seen in the array schedule, here's how you index a particular array or a field within the sub-object.
CREATE INDEX travel_sched ON `travel-sample`
(ALL DISTINCT ARRAY v.day FOR v IN schedule END);
- schedule - the array we’re dereferencing into.
- v - the variable implicitly declared to reference each element/object within the array: schedule
- v.day - the element within each object of the array schedule.
EXPLAIN SELECT * FROM `travel-sample`
WHERE ANY v IN SCHEDULE SATISFIES v.day = 2 END;
{
"#operator": "DistinctScan",
"scan": {
"#operator": "IndexScan",
"index": "travel_sched",
"index_id": "db7018bff5f10f17",
"keyspace": "travel-sample",
"namespace": "default",
"spans": [
{
"Range": {
"High": [
"2"
],
"Inclusion": 3,
"Low": [
"2"
]
}
}
],
"using": "gsi"
}
Because the key is a generalized expression, it provides the flexibility to apply additional logic and processing on the data before indexing. For example, you can create functional indexing on elements of each array. As you're referencing individual fields of the object or element within the array, the index creation, size, and search are efficient.
CREATE INDEX travel_sched ON `travel-sample`
(ALL ARRAY v.day FOR v IN schedule END);
Partial Index
{
"airline": {
"callsign": "MILE-AIR",
"country": "United States",
"iata": "Q5",
"icao": "MLA",
"id": 10,
"name": "40-Mile Air",
"type": "airline"
},
"documentkey": "airline_10"
}
Also, since Couchbase data model is JSON and JSON schema is flexible, an index may not contain entries to documents with absent index keys.
CREATE INDEX travel_info ON `travel-sample`(name, id, icoo, iata)
WHERE type='airline';
This creates an index only on documents that have type='airline'. The queries must include the filter type='airline' in addition to other filters for this index to qualify.
- Partitioning a large index into multiple indexes using the mod function.
- Partitioning a large index into multiple indexes and placing each index into distinct indexer nodes.
- Partitioning the index based on a list of values. For example, you can have an index for each state.
- Simulating index range partitioning via a range filter in the WHERE clause. Note that N1QL queries use one partitioned index per query block. Use UNION ALL to have a query exploit multiple partitioned indexes in a single query.
Duplicate Index
CREATE INDEX i1 ON `travel-sample`(LOWER(name),id, icoo)
WHERE type = 'airline';
CREATE INDEX i2 ON `travel-sample`(LOWER(name),id, icoo)
WHERE type = 'airline';
CREATE INDEX i3 ON `travel-sample`(LOWER(name),id, icoo)
WHERE type = 'airline';
All three indexes have identical keys and an identical WHERE clause; the only difference is the name of these indexes. You can choose their physical location using the WITH clause of the CREATE INDEX statement.
During query optimization, the query engine chooses one of the index names as seen in the explain plan. During query execution, these indexes are used in a round-robin fashion to distribute the load. Thus providing scale-out, multi-dimensional scaling, performance, and high availability.
Covering Index
Index selection for a query solely depends on the filters in the WHERE clause of your query. After the index selection is made, the query engine analyzes the query to see if it can be answered using only the data in the index. If it does, the query engine skips retrieving the whole document from the data nodes. This is a performance optimization to keep in mind when designing your indexes. For more information about covering indexes, see here.