If you know the Document Id/Key of the document, you should leverage the USE KEYS clause. This bypasses the Index Service (hence bypassing network, scans, index results and processing) - it's the closest thing you have on the N1QL side comparable to a KV fetch. The optimizer will use a KeyScan when you use the USE KEYS instead of an Index or Primary Scan. When the key is known and returning the entire document is required, always preference USE KEYS
over a META().id
index scan.
SELECT * FROM `travel-sample` USE KEYS ["landmark_37588"];
You can specify multiple values in USE KEYS
if you are querying for multiple documents.
SELECT * FROM `travel-sample` USE KEYS ["landmark_37588", "landmark_37603" ];
JOIN operations are also done using the document keys.
SELECT * FROM ORDERS o INNER JOIN CUSTOMER c ON KEYS o.id;
SELECT * FROM ORDERS o USE KEYS ["ord:382"] INNER JOIN CUSTOMER c ON KEYS o.id;
As this query is performing a KV GET() via N1QL, it would be even more performant to bypass N1QL altogether and issue a KV GET() directly via the SDK. bucket.get("landmark_37588")
Note that when selecting specific field(s), a covered index scan may be faster than performing the data service fetch when working with large documents or a long list of keys.
Remove any index keys/expressions that are listed in both the index and the indexes WHERE
statement as an equality predicate. These values would have zero cardinality and do not need to be indexed as they would result in slower IndexScans, they simply need to prevent documents who do not satisfy the condition as true from being added to the index. The query/index services are intelligent enough to understand a query and automatically cover values that are present as an equality predicate in the index.
Consider the following query:
SELECT userId, firstName, lastName
FROM ecommerce
WHERE docType = "user" AND username = "johnsmith21"
Now consider the following indexes, all of which will satisfy the query above:
CREATE INDEX idx_usernames ON ecommerce(docType, username)
This index emits both docType
and username
into the index. Not only is username
more unique, but you would expect that this index only contained just "user" documents. However, this index would contain an entry for every single document where the document had a docType
property, regardless of whether or not it has a username
property as only the leading key needs to qualify. Think of the leading key having a WHERE docType IS NOT MISSING
statement.
CREATE INDEX idx_usernames ON ecommerce(username, docType)
This index would successfully limit the indexes scope to just documents that contained a username
property, but again this is not guaranteed. we could filter the index by adding a WHERE docType = "user"
predicate:
CREATE INDEX idx_usernames ON ecommerce(username, docType)
WHERE docType = "user"
However, by having the docType
emitted into the index when it will only be a single value, it is just wasted bytes.
CREATE INDEX idx_usernames ON ecommerce(username)
WHERE docType = "user"
This is the best choice of all of the indexes above, as it will filter out any documents that do not satisfy WHERE docType = "user"
and only index the remaining documents that contain a username
property.
Often times referred to as "Partial Index", filtered indexes are an index on a subset of documents in the keyspace which are relevant to the query being executed. The result is a smaller index, which results in faster scans, yielding faster response times.
CREATE INDEX idx_cx3 ON `travel-sample` (state, city, name.lastname)
WHERE type = 'hotel'
CREATE INDEX idx_cx4 ON customer (state, city, name.lastname)
WHERE type = 'hotel' and country = 'United States' AND ratings > 2
The order of index keys, as well as the cardinality (uniqueness) of the values for a specific key/expression, can have a dramatic affect on query performance. Index keys should be first ordered based on the queries predicate types in the following order:
For keys who share the same predicate type, cardinality comes into play. As a general rule of thumb, order keys from left to right based on highest cardinality (most unique) to lowest cardinality (least unique) when they share the same predicate type. Note that the query predicates do not need to be listed in the order in which they are listed in the index, the query planner determines this automatically.
Consider the following query and index:
SELECT cid, address
FROM customer
WHERE type = 'premium'
AND state = 'CA'
AND zipcode IN [29482, 29284, 29482, 28472]
AND salary < 50000
AND age > 45
CREATE INDEX idx_orders ON customer(state, zipcode, salary, age, address, cid)
WHERE type = 'premium'
Even though zipcode
is more unique than state
, it is being queried using an IN
statement which is the equivalent of an OR
. It is more efficient to reduce the index entries first by state
then by zipcode
, however if our primary access pattern was zipcode = 29482
, then we would want to list it first.
Each index stores data pre-sorted by the index keys, matching the keys in the ORDER BY and leading N keys will avoid sorting. When exploiting the index order, the index keys should be added after all predicates and before any additional values that may be used to cover the query.
Take a scenario where you want to retrieve the order history for a given user, consider the following index and query:
CREATE INDEX `idx_order_history` ON `ecommerce` (
userId, orderDate, orderTotal, orderId
)
WHERE docType = "order"
SELECT orderId, orderDate, orderTotal
FROM ecommerce
WHERE docType = "order" AND userId = 123
ORDER BY orderDate DESC
If you examine the EXPLAIN
plan for the query near the bottom you will see:
{
"#operator": "Order",
"sort_terms": [
{
"desc": true,
"expr": "cover ((`customer`.`orderDate`))"
}
]
}
The Order operator takes all of the documents from the previous operator (IndexScan) and must loop over all of the records and sort them in-memory based on the ORDER BY
statement. Data is pre-sorted ASC
by default, with the previous index the ASC
is implied but would look like:
CREATE INDEX `idx_order_history2` ON `customer` (
userId ASC, orderDate ASC, orderTotal ASC, orderId ASC
)
WHERE docType = "order"
Knowing that we want our result ordered by the orderDate DESC
, we can inform the indexer to store the data in the order in which we will use it:
CREATE INDEX `idx_order_history_sorted` ON `ecommerce` (
userId, orderDate DESC, orderTotal, orderId
)
WHERE docType = "order"
Now if we issue an EXPLAIN
on the query, you will see the "Order" operator is missing as it is not needed since the result is already in the appropriate order.
Covering Indexes are indexes which contain all of the query predicates (WHERE
), all of the returning values (SELECT ...
) and any other processed attributes. When the index contains all of these values, it "covers" the query and the Query service does not need to go-to the Data service to obtain values for those fields. Covering indexes make queries efficient since it bypasses the "FETCH" from the Data service saving a significant amount of data transfer and processing. Both the Final Project and Filtering can be "covered" if an index is created appropriately.
Consider the following index:
CREATE INDEX idx_cx3 ON customer(state, city, name.lastname)
WHERE status = 'premium'
The following diagram illustrates the query execution workflow for a query that is not "covered":
SELECT *
FROM customer
WHERE state = 'CA' AND status = 'premium'
When performing a SELECT *
, a data service fetch will always be required, and the query cannot be covered.
The following diagram illustrates the query execution workflow where the query is covered:
SELECT status, state, city
FROM customer
WHERE state = 'CA' AND status = 'premium'
As you can see in the second diagram, a well-designed query that uses a covering index avoids the additional steps to fetch the data from the data service. This results in a considerable performance improvement.
We can verify that a query is "covered" by reviewing the explain plan:
{
...
"~children": [
{
"#operator": "IndexScan3",
"covers": [
"cover ((`customer`.`state`))",
"cover ((`customer`.`city`))",
"cover (((`customer`.`name`).`lastname`))",
"cover ((meta(`customer`).`id`))"
],
"filter_covers": {
"cover ((`customer`.`status`))": "premium"
},
...
}
]
}
It is important to note that a "FETCH" operation is not necessarily a bad thing, and every query does not always need to be "covered". Covering indexes offer the performance benefit of having all of the data in a single place and avoid a scatter-gather (i.e. fetches), but a covering index could potentially be fairly large and should be properly sized. A query which performs an IndexScan and returns a few records then performs a fetch, might be perfectly acceptable and meet SLAs. However, a query that returns tens or hundreds of thousands of rows and has to perform a fetch, could have a dramatic effect on system performance.
Unlike relational databases, Couchbase and N1QL do not require a primary index as long as the query has a viable secondary index to access the data.
A primary index scan is analogous to a full table scan in relational database systems. However, where a table scan stops at the "table", a primary index scan will perform the equivalent of a full database scan. N1QL will retrieve every document ID in the entire keyspace from the primary index, then fetch each document in the bucket and finally performing any predicate-join-project processing.
When a primary index is present, even though it may never be "intended" to be used, it acts as a fallback for any query who does not have a satisfying index at execution time. Think failure scenarios, if the queries qualifying index(es) go offline for whatever reason (i.e. node failure) and a primary index exists, it will be used, which could cause unintended and unexpected side-effects.
A keyspace (bucket) in Couchbase is a logical namespace for one or more types of documents. Standard practice in Couchbase is to ensure that each document has a common docType
, type
, _class
, etc. property that identifies the model/purpose of the document. This property is not only a good practice it allows for efficient indexing (i.e. Partial Index).
While filtering on the docType
is strongly encouraged as it creates a partial index on a subset of documents, creating an index on the docType
as a leading or stand-alone index key is inefficient and has extremely low cardinality.
CREATE INDEX `idx_docType` ON `bucket` (`docType`)
Indexes like this can inadvertently cause slow-performing queries and unexpected explain plans. For example, it increases the likelihood of an IntersectScan. More importantly though, if you've established the use of a docType
property on each of your documents, more than likely developers writing N1QL statements will add that as a predicate to each of their queries, i.e.:
SELECT *
FROM bucket
WHERE docType = "user" AND username = "johnsmith32"
Just as a primary index can be inadvertently used as a fallback when expected indexes are not found and cause potentially a high number of "FETCH" operations, a docType
index can do exactly the same thing and this is why it should be avoided.
When an index is created, the entire index only exists on a single node (replicas excluded). If the index grows in size and can no longer fit into memory in the case of MOI indexes, or the resident ratio drops low in the case of Standard GSI, you have to either add more resources to the node or you can partition the index.
The process of index partitioning involves distributing a single large index across multiple index nodes in the cluster.
Prior to Couchbase Server 5.5 this was achieved by creating multiple smaller range partitioned or partial indexes. The following query is our example:
SELECT userId, firstName, lastName
FROM ecommerce
WHERE docType = 'user' AND LOWER(username) = 'johnsmith32'
The following is the initial index that is used to cover the query:
CREATE INDEX idx_users ON ecommerce(LOWER(username), userId, firstName, lastName)
WHERE docType = 'user'
Over time the index grows in size and might not be performing as expected, can no longer fit on a single node, etc. and needs to be partitioned across multiple nodes:
CREATE INDEX idx_users_AtoM ON ecommerce(LOWER(username), userId, firstName, lastName)
WHERE docType = 'user' AND LOWER(username) >= 'a' AND LOWER(username) < 'n'
WITH { "nodes": ["index_host1"]}
CREATE INDEX idx_users_NtoZ ON ecommerce(LOWER(username), userId, firstName, lastName)
WHERE docType = 'user' AND LOWER(username) >= 'n' AND LOWER(username) < '['
WITH { "nodes": ["index_host2"]}
The same query above works without any changes. Indexes can be partitioned in an infinite number of ways. As of Couchbase Server 5.5+ you can define a single index, the partitioning strategy/keys as well as the number of partitions (16 by default) to create and we'll manage the distribution across the cluster automatically for you. Index Partitioning can be implemented by specifying the [PARTITION BY] clause.
CREATE INDEX idx_users ON ecommerce(LOWER(username), userId, firstName, lastName)
WHERE docType = 'user'
PARTITION BY HASH(LOWER(username))
WITH { "num_partitions": 10 }
Index partitioning has the benefit of "Partition Elimination", where the query planner understands that WHERE ... LOWER(username) = 'johnsmith32'
, is an equality and it knows which of the 10 partitions that indexed value lives on, and will scan just that single partition. Be sure to evaluate the key/expressions that you're using for partitioning as it can have an impact on the scans.
Many developers or frameworks will use SELECT * FROM ...
as much simpler and less work than typing out individual property SELECT orderId, orderDate, orderTotal FROM ...
. However, in general, this is not a wise thing to do regardless of the database being used, and in some situations, it can have serious performance implications.
Using SELECT *
returns all of the properties for each document. Oftentimes a query does not actually need all of the properties, only a select few. This causes unnecessary I/O, resulting in a larger payload size that must be transferred across the wire back to the application, where a smaller payload is more performant.
The most important reason to not use SELECT *
is you are limiting what the query optimizer can do to pick more appropriate indexes (i.e. a covering index). When SELECT *
is used, no matter what, 100% of the time a "FETCH" operation will be performed and it is impossible to further optimize the query without changing the underlying code / N1QL statement.
Consider the following scenario, where you want to retrieve all of the airlines for a given country and use the id
, name
and callsign
properties. Initially, our index might look like:
CREATE INDEX `idx_airline_country` ON `travel-sample` (country)
WHERE type = 'airline'
Now compare the following queries:
SELECT *
FROM `travel-sample`
WHERE type = 'airline' AND country = 'United States'
SELECT id, name, callsign
FROM `travel-sample`
WHERE type = 'airline' AND country = 'United States'
Both queries ultimately achieve the result and based on our initial index both queries would perform a "FETCH" operation, because of *
in query one, and because query two contains properties that are not in the index. Hypothetically let's say that this query is performed 10 times per second, the index may be perfectly fine and meet SLAs.
You have appropriate monitoring and profiling in place. Over time our application becomes more and more popular, and our query is now being performed 1,000 times per second. Originally, both queries returned 127 records, and a "FETCH" was being performed where 127 docs * 10qps = 1,270 get ops/sec
, but is now 127 docs * 1000qps = 127,000 get ops/sec
. This is causing system performance issues and you realize that improvements need to be made and the query needs a covering index, the profile of the second query tells you exactly what fields need to be covered and you create the following index in production:
CREATE INDEX `idx_airline_country_cvr` ON `travel-sample` (country, id, name, callsign)
WHERE type = 'airline'
If you were using the second query, the number of gets/second would go from 127,000
to 0
and you were able to achieve this without a single code change or deployment to the application. However, if you were using the first query this index would have not offered any performance benefit, the application code would have to be updated and then deployed to see the performance gains.
When specified the USE INDEX clause hints to the query optimizer that the index(es) listed should be preferred. The query optimizer is a rule-based optimizer, and will generally pick the most appropriate index to satisfy a query. Additionally, with every release of Couchbase, there are improvements made to the optimizer, making it more efficient.
If a USE INDEX(...)
is specified it couples code and operations. Meaning the index that is specified, cannot be dropped without a code change, as the code is expecting the index to be there and if it is not this would result in an error. Moreover, you cannot optimize the index that is referenced without dropping it first which would result in a period of downtime while the index is being built and the creation of a more optimized index wouldn't yield any benefit either as it would not be preferred.
The use of the USE INDEX()
statement can be beneficial with the elimination of IntersectScans but should be done with caution for the reasons listed above.
The
USE INDEX(...)
clause can accept a comma-delimited list of indexes.
Optimizing pagination queries is usually the most critical part of tuning. Exploiting index ordering is even more beneficial when paginating. Both OFFSET and LIMIT are attempted to be pushed down to the indexer whenever possible, this depends on a few different factors:
WHERE
) can be pushed down to a single index (i.e. all index keys exist)ORDER BY
clause is used, the index keys must be in the same orderJOIN
clauseIf all of these are true then the LIMIT
and OFFSET
are pushed to the indexer, otherwise they are applied by the query service after all IndexScans are performed. Take the following index for example:
CREATE INDEX idx_products ON ecommerce (
productCategory, productName, productPrice, productId
)
WHERE docType = 'product'
This query retrieves all of the products sorted by the productName
property and exploits the index order so LIMIT
and OFFSET
are pushed down to the indexer.
SELECT productId, productName, productPrice
FROM ecommerce
WHERE docType = 'product' AND productCategory = "Electronics"
ORDER BY productName ASC
LIMIT 100
OFFSET 300
This can be verified by examining the EXPLAIN
plan of the query, in the IndexScan3
#operator will show both of the "limit" and "offset" properties, indicating that they were pushed down to the indexer.
{
"#operator": "IndexScan3",
...
"index": "idx_products",
"limit": "100",
"offset": "300",
...
}
Now if we adjust the query to sort by productPrice DESC
.
SELECT productId, productName, productPrice
FROM ecommerce
WHERE docType = 'product' AND productCategory = "Electronics"
ORDER BY productPrice DESC
LIMIT 100
OFFSET 300
The query is still covered, as before but now the "Limit" and "Offset" are pushed to the bottom of the Sequence and are the last thing to happen before the final projection:
[
...
{
"#operator": "Order",
"limit": "100",
"offset": "300",
"sort_terms": [{
"desc": true,
"expr": "cover ((`customer`.`productPrice`))"
}]
}, {
"#operator": "Offset",
"expr": "300"
}, {
"#operator": "Limit",
"expr": "100"
}
]
LIMIT
and OFFSET
are generally the go to for database pagination and Couchbase Server has many optimizations that make these operations really fast. There is one drawback to limit/offset pagination in any database and that is the greater the offset, the longer the initial index scan has to traverse the index before the limit can be implied. Oftentimes this is negligible, however, if you want to squeeze every last bit of performance out of pagination there is another approach called KeySet Pagination.
It is the responsibility of the application to sanitize and inspect dynamic/input data before sending it to the database. If an application uses this input to dynamically construct a query, it is opening the database to SQL injection attacks.
def airports_in_city(city):
query_string = "SELECT airportname FROM `travel-sample` WHERE city="
query_string += '"' + city + '"'
return cb.n1ql_query(query_string)
This is insecure as any value or N1QL statement could be passed as city
. N1QL allows the use of placeholders to declare dynamic query parameters. Query parameters (named or positional) allow your application to securely use dynamic query arguments for your application.
Implement named or positional parameters for all dynamic query arguments.
def airports_in_city(city):
query_string = "SELECT airportname FROM `travel-sample` WHERE city=$1"
query = N1QLQuery(query_string, city)
return cb.n1ql_query(query)
Not only is this more secure, but it also simplifies query profiling and tuning, while the same statement is issued with different parameters, it can be profiled as the same.
It can be an easy habit to get into of optimizing every query and have a 1:1 ratio for the query to index. This is not necessary and can be avoided by expecting common leading keys of various indexes and combining multiple indexes into a single index that can service multiple queries. Take the following queries an example:
SELECT orderId, orderDate, orderTotal
FROM ecommerce
WHERE docType = 'orders'
AND billing.country = 'US'
CREATE INDEX idx_orders_country ON ecommerce (billing.country)
WHERE docType = 'orders'
SELECT orderId, orderDate, orderTotal
FROM ecommerce
WHERE docType = 'orders'
AND billing.country = 'US'
AND billing.state = 'CA'
CREATE INDEX idx_orders_state_country ON ecommerce (
billing.state, billing.country
)
WHERE docType = 'orders'
SELECT orderId, orderDate, orderTotal
FROM ecommerce
WHERE docType = 'orders'
AND billing.country = 'US'
AND billing.state = 'CA'
AND orderTotal >= 1000
CREATE INDEX idx_orders_country ON ecommerce (
billing.state, billing.country, orderTotal
)
WHERE docType = 'orders'
All of these indexes can be combined into a single index. It should be pointed out in this example, billing.country
is the first index key, as the query that uses it expects billing.country
to be the leading key of the index, this may or may not have an effect on SLAs and should be tested as billing.state
has a higher cardinality than billing.country
but may be negligible.
CREATE INDEX idx_orders_country ON ecommerce (
billing.country, billing.state, orderTotal
)
WHERE docType = 'order'
When a N1QL statement is sent to the server, the Query service will inspect and parse the string, determining which indexes to query, ultimately defining a Query Plan to optimally satisfy the statement. The computation for the plan adds some additional processing time and overhead for the query.
Often-used queries can be prepared so that the computed plan is generated only once. Subsequent queries using the same query string will use the pre-generated plan instead, saving on the overhead and processing of the plan each time. Parameterized queries are considered the same query for caching and planning purposes, even if the supplied parameters are different.
There are two approaches to implementing prepared statements, choose one that best fits your environment.
This method of prepared statements sets the adhoc = false
option for a given query. The SDK will internally prepare the statement and store the plan in an internal cache specific to that SDK instance. After the statement has been initially prepared the first time, subsequent calls to the same statement will pass the prepared plan to the Query service, eliminating the inspection, parsing, and planning steps and start executing immediately.
query = N1QLQuery("SELECT airportname FROM `travel-sample` WHERE country=$1", "USA")
q.adhoc = False
This method of prepared statements is similar to the previous option but instead of the SDK managing the planning of the queries, it would be managed by the application or through an external process.
First, the query has to be prepared by using a PREPARE Statement
PREPARE unique_name_for_query FROM
SELECT airportname FROM `travel-sample` WHERE country=$1
Once the query has been prepared, the query plan is cached in the Query service and can be executed by using an EXECUTE Statement.
query = N1QLQuery("EXECUTE unique_name_for_query", "USA")
The tradeoff is there is a single cached plan that can be used by many clients, however, the lifecycle of the plan must be maintained. For example, if the plan doesn't exist an error will be thrown, that would need to be trapped and then re-prepare the query and execute again.
An IntersectScan is when two or more indexes are used to satisfy a query. This results in two or more separate IndexScan operations, each returning back qualifying results (i.e. meta().id
) and then intersecting the scans together only returning results that are present in both. Consider the following query and indexes:
SELECT *
FROM `travel-sample`
WHERE type = "landmark" AND activity = "drink" AND country = "France"
CREATE INDEX `idx_landmark_activity` ON `travel-sample` (activity)
WHERE type = "landmark"
CREATE INDEX `idx_landmark_country` ON `travel-sample` (country)
WHERE type = "landmark"
If you execute this query it will run in ~100ms, examining the plan text shows that an IntersectScan is performed, and an IndexScan is run on each of the above indexes. The scan on idx_landmark_country
returns 388 results and the scan on idx_landmark_activity
returns 287 results, both of these are passed to the IntersectScan operator for a total of 675 records and then filters that down to 388 results.
{
"~children": [{
"#operator": "IntersectScan",
"#stats": {
"#itemsIn": 675,
"#itemsOut": 388,
...
},
"scans": [{
"#operator": "IndexScan3",
"#stats": {
"#itemsOut": 388,
...
},
"index": "idx_landmark_country",
},
{
"#operator": "IndexScan3",
"#stats": {
"#itemsOut": 287,
},
"index": "idx_landmark_activity",
}
]
}]
}
Alternatively, if a composite index is used instead, this will result in a single IndexScan operation and be more performant. The same query using the index below will execute in ~17ms.
CREATE INDEX `idx_landmark_activity_country` ON `travel-sample` (activity, country)
WHERE type = "landmark"
In general, a single wide index (composite index) which meets the criteria will be more performant than intersections on multiple singular indexes. Only consider intersection when the predicate usage is non-deterministic.
Oftentimes we need to find partial matches within a given index key and use a LIKE
statement. This is useful and convenient syntax, however, this performs a range scan that depending on the use of %
can be a range of the entire index and result in a lot of extra processing. For this example, we'll use the following index and base query:
CREATE INDEX idx_landmark_names ON `travel-sample` (name)
WHERE type = "landmark"
SELECT name
FROM `travel-sample`
WHERE type = "landmark"
AND name LIKE '%Theater%'
Execution Time: ~190ms
Now let's examine the various spans associated with the LIKE
statement:
Statement | Low | High | Inclusion |
---|---|---|---|
'%Theater%' | "" | "[]" | 1 |
'Theater%' | "Theater" | "Theates" | 1 |
'%Theater' | "" | "[]" | 1 |
Clearly, the second option 'Theater%'
offers the more performant range scan as it is a targeted subset of the index key instead of the entire index. Optimizing your LIKE
queries to only match on the righthand side offers some performance benefit, but this is not always a possibility.
A powerful feature of N1QL is that it can index individual array elements, not just individual scalar properties. While name
is a simple string, we can use any of the available string functions to convert the string into an array (SPLIT, TOKENS, SUFFIXES, etc.) For this example, we'll use the SUFFIXES()
function.
SELECT RAW SUFFIXES("Clay Theater")
[
[
"Clay Theater",
"lay Theater",
"ay Theater",
"y Theater",
" Theater",
"Theater",
"heater",
"eater",
"ater",
"ter",
"er",
"r"
]
]
As this returns all possible suffixes of a given string as an array, we can effectively index this array and now remove the left-hand %
from our query providing a more efficient range scan.
CREATE INDEX idx_landmark_names_suffixes ON `travel-sample` (
DISTINCT ARRAY v
FOR v IN SUFFIXES(LOWER(name))
END
)
WHERE type = 'landmark'
SELECT name
FROM `travel-sample`
WHERE type = "landmark"
AND ANY v IN SUFFIXES(LOWER(name))
SATISFIES v LIKE 'theater%'
END
Execution Time: ~17ms
This query is now 11 times faster than the original. You should always consider the size of an array index, as in this case the larger the string the larger the array for each item and the larger the index size. Reference
Many times we need to write a query that can satisfy "this OR that" and return results from either the left-hand or right-hand side of the OR
. Lets take a common example where a user needs to login to an application with their "username" OR "email".
SELECT userId, pwd, firstName, lastName
FROM ecommerce
WHERE docType = 'user'
AND (
username = 'johns'
OR
email = 'johns'
)
And we'll start with the following indexes:
CREATE INDEX `idx_username` ON `ecommerce` (username)
WHERE docType = 'user'
CREATE INDEX `idx_email` ON `ecommerce` (email)
WHERE docType = 'user'
Viewing the explain plan of this query shows that a UnionScan
is performed using two separate IndexScan
operations on the idx_username
and idx_email
. A simple approach, in this case, would be to have the application construct two separate queries and inspect the input prior to issuing the query as an email pattern is trivial to validate, which would eliminate the UnionScan
and result in a single IndexScan
. This might not always be possible, so next, you might drop the previous indexes and attempt to create a single index to cover both username
and email
address:
CREATE INDEX `idx_username_email` ON `ecommerce` (username, email)
WHERE docType = 'user'
But the query would fail, because a UnionScan
is still attempted, it is just two scans of the same index. It would need to be rewritten as follows for the same index to use:
SELECT userId, pwd, firstName, lastName
FROM ecommerce
WHERE docType = 'user'
AND (
username = 'johns'
OR
username IS NOT MISSING AND email = 'johns'
)
This is not an optimum approach either. While we only maintain a single index it results in a UnionScan
and the range on the first index key is scanned completely.
Array indexes are very powerful and provide optimized execution of queries when array elements are used, something that is not possible with traditional databases. In this case, our data is not an array, but we can index it as one by creating a functional array.
CREATE INDEX idx_username_email_arr ON `ecommerce` (
DISTINCT ARRAY v
FOR v IN [LOWER(username), LOWER(email)]
END
)
WHERE docType = 'user'
SELECT userId, pwd, firstName, lastName
FROM ecommerce
WHERE docType = 'user'
AND ANY v IN [LOWER(username), LOWER(email)]
SATISFIES v = 'johns'
END
The explain plan verifies that only a single DistinctScan
is performed against our index. This query would only expect a single result but is performing a "FETCH" as the fields userId, pwd, firstName, lastName
are not in the index. If we wanted to squeeze every last bit of performance we could cover the query with the index:
CREATE INDEX idx_username_email_arr_cvr ON `ecommerce` (
DISTINCT ARRAY v
FOR v IN [LOWER(username), LOWER(email)]
END,
userId, username, pwd, firstName, lastName
)
WHERE docType = 'user'
Equality predicates are preferred and more performant than range scans, as a range is bounded by a low and high value, and the performance of the scan depends on how wide the range scan is. As an example, almost every application works with dates in some form or fashion, and are typically stored in either ISO-8601 (i.e. "2019-01-15T10:42:23Z") or Epoch time (i.e. "1547548943000") formats. Consider the following query and index to find all of the orders on a specific day:
CREATE INDEX idx_orderDate ON `ecommerce` (orderDate)
WHERE docType = "order"
SELECT orderId, orderDate, orderTotal
FROM ecommerce
WHERE docType = "order"
AND orderDate >= "2019-01-15T00:00:00"
AND orderDate < "2019-01-16"
This results in a range scan:
{
"range": [
{
"high": "\"2019-01-16\"",
"inclusion": 1,
"low": "\"2019-01-15T00:00:00\""
}
]
}
However, if we create a functional index we can use a single equality predicate i.e. WHERE orderDate = "2019-01-15"
CREATE INDEX idx_orderDate_date ON `ecommerce` (SPLIT(orderDate, "T")[0])
WHERE docType = "order"
SELECT orderId, orderDate, orderTotal
FROM ecommerce
WHERE docType = "order"
AND SPLIT(orderDate, "T")[0] = "2019-01-15"
Indexes can (and should) be replicated across cluster-nodes. This ensures:
Prior to Couchbase Server 5.0, the only way to have "replica" indexes was to have the same index definition, but with a different name and manually place the indexes on different hosts.
CREATE INDEX productName_index1 ON bucket_name(productName, ProductID)
WHERE type="product"
WITH { "nodes": ["host1"] }
CREATE INDEX productName_index2 ON bucket_name(productName, ProductID)
WHERE type="product"
WITH { "nodes": ["host2"] }
As Couchbase Server 5.0+ index replicas are specified by using the WITH
clause, simply specify the num_replica
value.
CREATE INDEX productName_index1 ON bucket_name(productName, ProductID)
WHERE type="product"
WITH { "num_replica": 2 };
The only requirement is that the number of nodes in the cluster running the Index service is greater than or equal to {num_replica} + 1
. Replicas can also be created by specifying the destination nodes of the index:
CREATE INDEX productName_index1 ON bucket_name(productName, ProductID)
WHERE type="product"
WITH { "nodes": ["node1:8091", "node2:8091", "node3:8091"] }
Additionally, both num_replica
and nodes
can be specified as long as num_replica
is equal to the length of the nodes
array + 1.
CREATE INDEX productName_index1 ON bucket_name(productName, ProductID)
WHERE type="product"
WITH { "num_replica": 2, "nodes": ["node1:8091", "node2:8091", "node3:8091"] }
Whenever a CREATE INDEX
statement is issued, the default number of index replicas to create is 0
. This value can be changed, such that anytime a CREATE INDEX
is performed there is no need to specify the WITH clause and replicas will be created automatically.
curl \
-u Administrator:password \
-d "{\"indexer.settings.num_replica\": 2 }"\
http://localhost:9102/settings
When a CREATE INDEX
statement is issued, each document in the keyspace must be projected against the index and by default, this is a synchronous operation. Meaning it will block until the index is built 100%, at which point in time the index will be updated asynchronously by any future mutations. This can be cumbersome and time-consuming, especially when managing many indexes.
In Couchbase, there can only be one index build process going on at a time. However, that does not mean that there can only be one index being built at a time. N1QL allows you to define the index but defer the actual building of the index to a later point in time, this is done using the WITH { "defer_build": true }
.
CREATE INDEX `def_sourceairport` ON `travel-sample`(`sourceairport`)
WITH { "defer_build":true }
CREATE INDEX `def_city_state` ON `travel-sample`(`city`, `state`)
WITH { "defer_build":true }
The indexes are in a "Created" state and not eligible to be used until they are built. To build multiple indexes at the same time a BUILD INDEX is used.
BUILD INDEX ON `travel-sample` (def_sourceairport, def_city_state)
The BUILD INDEX
is asynchronous by default and has the primary benefit of allowing each index build to share the same DCP stream, and each document that is projected for the index build only has to be retrieved once instead of once per index.
The Data, Index and Query service all work together to perform N1QL queries and manage indexes. While the bulk of index processing resides with the Index Service, there are two components of indexing that actually reside within the Data Service. These are the "Projector" and "Router" processes, which are responsible for projecting every data mutation against each index on the bucket and communicating those mutations to the "Supervisor" process on the Index Service.
This is roughly the # of documents that would qualify for the index divided by the total # of documents in the bucket.
For example, a bucket with 1,000,000
documents, and 100
of those documents were "config" documents, if we created an index with the predicate of WHERE docType = "config"
the Projection Selectivity is 0.10%. Conversely, if you created an index on just docType
for example and every document in the bucket has docType
the Projection Selectivity is 100%.
Both of these are on the extreme ends of the spectrum, the first will only satisfy projection 0.10% of the time and there would be many wasted CPU cycles on unnecessary projection and you may want to consider an alternative access pattern. Additionally, having a Projection Selectivity of 100% is also not performant as every mutation meets projection and results in an update to the index. There is not a set number, as it will be specific to each data set, but understand the implications of unnecessary projection.
CASE
ExpressionsOften, it is necessary to calculate different aggregates on various sets of documents. Usually, you achieve this goal with multiple scans on the data, but it is easy to calculate all the aggregates with a single scan. Eliminating n-1 scans can greatly improve performance.
You can combine multiple scans into one scan by moving the WHERE
condition of each scan into a CASE expression, which filters the data for the aggregation. For each aggregation, there could be another field that retrieves the data.
The following example asks for the hotels which have free internet or free breakfast or free parking. You can obtain this result by executing three separate queries:
SELECT COUNT(*) FROM `travel-sample` WHERE type="hotel" and free_internet=true
SELECT COUNT(*) FROM `travel-sample` WHERE type="hotel" and free_breakfast=true
SELECT COUNT(*) FROM `travel-sample` WHERE type="hotel" and free_parking=true
Total time = 119.23 + 121.63 + 118.71 = 359.57ms
However, it is more efficient to run the entire query in a single statement. Each number is calculated as one field. The count uses a filter with the CASE
expression to count only the rows where the condition is valid. For example:
SELECT COUNT(CASE WHEN free_internet=true THEN 1 ELSE null END) cnt_free_internet,
COUNT(CASE WHEN free_breakfast=true THEN 1 ELSE null END) cnt_free_breakfast,
COUNT(CASE WHEN free_parking=true THEN 1 ELSE null END) cnt_free_parking
FROM `travel-sample`
WHERE type="hotel"
The query takes 250 ms, shaving off an entire 100ms off the total time taken, not to mention the network calls, processing etc. that it would take while querying it from the application side.
This is a very simple example, larger datasets will show larger variances. There could be ranges involved, the aggregation functions could be different, etc.
Different access paths can determine the optimum structure of your documents. When you are leveraging N1QL, you will want to ensure your documents are "index friendly". Ensure all documents have a consistent docType
(or equivalent) attribute. This allows for efficient filtering during indexing, querying or both.
{
"docType": "user",
...
}
If you're performing some processing on an array in the document, instead of performing that processing in the query consider storing a pre-computed value in another attribute (e.g. total, average, score, etc.)
{
...
"scores": [
78,
97,
23,
...,
43
],
"scoreTotal": 2392,
"scoreAvg": 77.6,
"scoreMin": 23,
"scoreMax": 99
}
If an attribute name is dynamic or unknown, it's OK for KV access but not practical from an indexing standpoint. Below the first example is simply not practical to index or search if that object had every single language listed in there, you would need one index per language.
Bad for Index/Querying:
{
"Greetings": {
"English": "Good Morning",
"Spanish": "Buenos días",
"German": "Guten Morgen",
"French": "Bonjour"
}
}
Good for Index/Querying:
{
"Greetings": [
{
"Language": "English",
"Greeting": "Good Morning"
},
{
"Language": "Spanish",
"Greeting": "Buenos días"
},
{
"Language": "German",
"Greeting": "Guten Morgen"
},
{
"Language": "French",
"Greeting": "Bonjour"
}
]
}
Couchbase buckets are a logical namespace made up of one or more types of documents/models. Understanding these models and their underlying structures can provide additional opportunities for performance and tuning. N1QL supports the INFER statement, that is statistical in nature. It analyzes a bucket through sampling and returns results in JSON Schema format.
INFER `travel-sample` WITH {
"sample_size": 1000,
"num_sample_values": 5,
"similarity_metric": 0.6
}
For each identified attribute, the statement returns the following details:
Attribute | Description |
---|---|
#docs | Specifies the number of documents in the sample that contain this attribute. |
%docs | Specifies the percentage of documents in the sample that contain this attribute. |
minitems | If the data type is an array, specifies the minimum number of elements (array size). |
maxitems | If the data type is an array, specifies the maximum number of elements (array size). |
samples | Displays a list of sample values for the attribute found in the sample population. |
type | Specifies the identified data type of the attribute. |
CAS or Compare-and-Swap values are used as a form of optimistic locking for document mutations. The CAS value can be returned from a N1QL query by calling the META()
function and referencing the META().cas
value.
If you are leveraging CAS operations and attempting to cover your queries, this value will need to be added to the index.
N1QL indexes are updated asynchronously after a mutation has occurred on the data service. While key-value operations against the data service are strongly consistent, and while index updates are extremely fast (usually <1ms), because of the asynchronous nature they are "eventually consistent".
N1QL allows you to specify the Scan Consistency level to use for a query. There are three possible values:
mutationTokens
to be enabled before it can be used. This causes a few extra bytes of information to be sent back to the SDK whenever a mutation occurs. The mutation tokens are then passed to the N1QL query and the query will wait for at least those tokens to be indexed prior to proceeding.// enable mutation tokens
CouchbaseEnvironment env = DefaultCouchbaseEnvironment
.builder()
.mutationTokensEnabled(true)
.build();
// mutate a document and save the mutation
JsonDocument written = bucket.upsert(JsonDocument.create("mydoc", JsonObject.empty()));
// written.mutationToken() ==
// "mt{vbID=55, vbUUID=166779084590420, seqno=488, bucket=travel-sample}"
// pass the mutation to the query
bucket.query(
N1qlQuery.simple("select count(*) as cnt from `travel-sample`",
N1qlParams.build().consistentWith(written))
);
There is a price to pay for using a query scan consistency such as Request+ or Statement+. See if you can achieve the desired result by key-value Access, especially if it's a single document or a handful of documents which can be obtained via a multi-get.
The IN operator specifies the search depth to only include the current-level of the array on which it is operating against. Whereas the WITHIN operator specifies the search depth to include the current level of the array it's operating on and all of the children and descendant arrays indefinitely.
The use of WITHIN
can have unexpected performance consequences if used incorrectly, by recursively iterating through all arrays of a given property.
Active N1QL requests are stored in the system:active_requests
catalog, if the result of the query is greater than 1s
it will be stored in the system:completed_requests
catalog, otherwise it is discarded. If you need to cancel any active requests, you can issue a DELETE
statement against the system:active_requests
keyspace effectively canceling the query.
DELETE
FROM system:active_request
WHERE requestId = "..."
The system:completed_requests
is extremely useful with identifying slow performing and resource intensive queries. As you continue to iterate through each of the queries and optimize them, you no longer want to see that same query in system:completed_requests
. You can delete records from the system:completed_requests
catalog, just as you would with any other keyspace.
DELETE
FROM system:completed_requests
WHERE requestId = "..."
DELETE
FROM system:completed_requests
WHERE statement = "SELECT DISTINCT type\nFROM `travel-sample`"
Designing, building and iterating on indexes against a bucket with a large number of documents can be time-consuming. When initially designing queries and indexes, if possible, execute them against an empty bucket until you're satisfied with the explain plan.
Proactive monitoring is critical to any application, not only should you be actively monitoring indexes and their usage to identify potential growth needs, you should also be monitoring and identifying indexes which are not used and can be dropped.
The clientContextID
is a user-defined identifier query option that can be sent to the query service from the SDK. This is not used by Couchbase for anything, however, it is stored in the system:active_requests
and system:completed_requests
. This can be a true application requestId/genesisId generated by the application, or some type of application identifier that can be later used for debugging or tracking down specific queries related to a given application or request.