Array Indexing

Array Indexing

Couchbase 4.5 adds the capability to create global indexes on array elements and optimizes the execution of queries involving array elements. This is a huge leap from the previous versions where secondary indexes could only be created and subsequently queried on whole arrays. You can now create an index of array elements ranging from plain scalar values to complex arrays or JSON objects nested deeper in the array.

N1QL Syntax

CREATE INDEX [ index_name ] 
        ON named_keyspace_ref ( index_key1, index_key2 | index_key3, ... )
        [ WHERE filter_expr ]
        [ USING GSI ]
        [ WITH { "nodes": [ "node_name" ], 
                "defer_build": true | false
              }
        ];
index_name
Specify a unique name to identify the index.
named_keyspace_ref
named_keyspace_ref ::= [ namespace_name : ] keyspace_name

Specify the name of the keyspace or bucket to create an index on

index_key
index_key ::= expr | array_expr
Refers to the an attribute name or a scalar function or an ARRAY expression on the attribute. This constitutes an index-key for the index.
expr
A N1QL expression over any fields in the document. This cannot use constant expressions, aggregate functions, or sub-queries.
array_expr
array_expr ::= full_array_expr | simple_array_expr
full_array_expr ::=
    (ALL | DISTINCT) ARRAY var_expr 
        FOR var1 ( IN | WITHIN ) expr1
        [ , var2 ( IN | WITHIN ) expr2 ]
        [ ( WHEN condition ) ] END
simple_array_expr ::=
    (ALL | DISTINCT) expr

The ARRAY operator lets you map and filter the elements or attributes of a collection, object, or objects. It evaluates to an array of the operand expression that satisfies the WHEN clause, if specified.

var_expr ::= function(var1, var2, ...)

  • The expression is a function of the variables var1, var2, etc. used in the FOR clause, and the expressions expr1, expr2, etc. evaluate to an array of objects, elements of which are represented by the variables var1, var2, etc. respectively.
  • The var_expr itself can be an array_expr, allowing nested array expression. This enables creating array indexes on nested array fields. See examples below.
simple_array_expr

When using simple_array_expr, only an array field name or an expression expr that can evaluate to an array can be specified. In this case, all elements of the array are indexed.

Note: Currently, array indexing is limited to using only one index key with array_expr. Furthermore, the array_expr must be the first index-key for the index to be leverages for a SELECT statement with an UNNEST clause on the array. To create an array index involving multiple array elements or multiple arrays:
  • The var_expr can be constructed as a compound object constituted with different elements of the same array or multiple arrays.
  • Subsequent SELECT or DML must use similar compound objects in the WHERE clause to use the array index. See examples below.
Note: Couchbase 4.6.2 release introduces the following enhancements to array indexing:
  • Allow arbitrary variable names in array index selection. That is, a SELECT query or DML that needs to use the array index can use different variable names in the query from those used in the array index definition. In earlier releases, the variable names must exactly match. See 4.5 documentation for details.
  • Support simple_array_expr, which provides simpler syntax for array indexing when all array elements are indexed as is, without requiring to use the ARRAY operator in the index definition.
Note: The variables used in query predicates (WHERE clauses) of subsequent SELECT/UPDATE/UPSERT/DELETE statements must be the same as those used in the above array_expr. See Format of Query Predicate in SELECT, UPDATE, or DELETE for details.
filter_expr
Specifies WHERE clause predicates to qualify the subset of documents to include in the index.
USING GSI
USING clause specifies the index type to use.
WITH options
Use the WITH clause to specify additional options and is fully supported with nodes and defer_build expressions.

See the CREATE INDEX statement for more details on the syntax.

Format of Query Predicate in SELECT, UPDATE, or DELETE

The query predicate, which must appear in the WHERE clause of a SELECT, UPDATE, or DELETE statement, must have the exact matching format as the variable in the array index key, including the name of the variable x.

Consider the following expressions used in a CREATE INDEX statement:
(a) DISTINCT ARRAY f(x) FOR x IN expr1 END;

(b) DISTINCT ARRAY f(x) FOR x WITHIN expr1 END;
And the expression used in the SELECT where-clause as:
(c) ANY x IN expr2 SATISFIES g(x) END;

(d) ANY x WITHIN expr2 SATISFIES g(x) END
The following dependencies must be satisfied in Couchbase Server 4.5 for N1QL to consider the array index:
  • The index keys used in CREATE INDEX must be used in the where-clause.
  • expr2 in (c) and (d) must be equivalent to expr1 in (a) and (b). This is a formal notion of equivalence. For example, if they are the same expressions, or equivalent arithmetic expressions such as (x+y) and (y+x).
  • The variable x in (c) and (d) must be exactly the same variable name as x in (a) and (b).
  • g(x) in (c) and (d) must be sargable for f(x) in (a) and (b). In other words, if there was a scalar index with key f(x), that index would be applicable to the predicate g(x). For example, the index key UPPER(x) is sargable for the predicate UPPER(x) LIKE "John%".
  • IN vs. WITHIN: Index key (a) can be used for query predicate (c). Index key (b) can be used for both query predicates (c) and (d).
Note: Index key (b) is strictly more expensive than index key (a), for both index maintenance and query processing. Index key (b) and query predicate (d) are very powerful. They can efficiently index and query recursive trees of arbitrary depth.

Examples

The following samples use the travel-sample bucket that is shipped with the product.

Example 1: Indexing all DISTINCT elements in an array

C1: Create an index on all schedules:
CREATE INDEX idx_sched 
ON `travel-sample` ( DISTINCT ARRAY v.flight FOR v IN schedule END );
Q1: The following query finds the list of scheduled 'UA' flights:
SELECT * FROM `travel-sample`
WHERE ANY v IN schedule SATISFIES v.flight LIKE 'UA%' END;

Example 2: Partial index (with WHERE clause) of individual attributes from selected elements (using WHEN clause) of an array:

C2: Create an index on flight IDs scheduled in the first 4 days of the week:
CREATE INDEX idx_flight_day 
ON `travel-sample` ( ALL ARRAY v.flight FOR v IN schedule WHEN v.day < 4 END )
WHERE type = "route" ;
Q2: The following query finds the list of scheduled 'UA' flights on day 1:
SELECT * FROM `travel-sample`
WHERE type = "route"
AND ANY v IN schedule SATISFIES (v.flight LIKE 'UA%') AND (v.day=1) END;
Note: The index C2 qualifies for the query Q2 because:
  • Q2 predicate type = "route" matches that of the partial index WHERE clause.
  • The ANY operator uses the index key v.flight on which the index C2 is defined.
  • The ANY-SATISFIES condition v.day=1 in Q2 is sargable to that in the index definition WHEN clause v.day<4.

Example 3: Compound array index with individual elements of an array and other non-array fields

C3: Create an index on scheduled flight IDs and number of stops:
CREATE INDEX idx_flight_stops 
ON `travel-sample` 
    ( stops, DISTINCT ARRAY v.flight FOR v IN schedule END )
WHERE type = "route" ;
Q3: The following query finds the list of scheduled 'UA' flights that have one or more stops:
SELECT * FROM `travel-sample`
WHERE type = "route"
AND stops >=0
AND ANY v IN schedule SATISFIES v.flight LIKE 'UA%' END;  

Example 4: Indexing the individual elements of nest arrays

Use the DISTINCT ARRAY clause in a nested fashion to index specific attributes of a document when the array contains other arrays or documents that contain arrays. For example,
cbq> UPDATE `travel-sample` 
     SET schedule[0] = {"day" : 7, "special_flights" : 
                    [ {"flight" : "AI444", "utc" : "4:44:44"}, 
                      {"flight" : "AI333", "utc" : "3:33:33"} 
                    ] } 
     WHERE type = "route" 
     AND destinationairport = "CDG" AND sourceairport = "TLV";
C4: The following creates a partial index on a nested array special_flights:
CREATE INDEX idx_nested ON `travel-sample`
    (DISTINCT ARRAY
        (DISTINCT ARRAY y.flight FOR y IN x.special_flights END)
    FOR x IN schedule END) 
WHERE type = "route";
Note: In this case, the inner ARRAY construct (in bold) is used as the var_expr for the outer ARRAY construct in the N1QL Syntax above.

Q4: The following query uses nested ANY operator to use the index:

SELECT count(*) FROM `travel-sample`
WHERE type = "route"
AND ANY x in schedule SATISFIES
    (ANY y in x.special_flights SATISFIES y.flight IS NOT NULL END)
END;

Q4A: The following query uses UNNEST operators to use the index:

SELECT count(*) FROM `travel-sample`
UNNEST schedule AS x
UNNEST x.special_flights AS y
WHERE type = "route"
AND y.flight IS NOT NULL;
Example 5: Array Index with multiple elements of an array

C5: Create an index on flight and day fields in schedule:

CREATE INDEX idx_flight_day ON `travel-sample`
    ( DISTINCT ARRAY [v.flight, v.day] FOR v IN schedule END)
WHERE type = "route" ;

Q5: The following query finds the list of scheduled 'US681' flights on day 2:

SELECT meta().id FROM `travel-sample`
WHERE type = "route"
AND ANY v in schedule SATISFIES [v.flight, v.day] = ["US681", 2] END;
Example 6: Indexing all elements in an array using simplified syntax

C6: Create an index on all schedules using simplified array index syntax:

CREATE INDEX idx_sched_simple
ON `travel-sample` (ALL schedule)
WHERE type = "route";

Q6: The following query finds details of all route documents matching a specific schedule. Note that elements of schedule array are objects, and hence the right side value of the predicate condition should be similarly structured object.

SELECT * FROM `travel-sample`
WHERE type = "route"
AND ANY v IN schedule 
SATISFIES v = {"day":2, "flight": "US681", "utc": "19:20:00"} END;

Q6A: This is a variant of Q6 using the UNNEST in the SELECT statement. The following query finds details of all route documents matching a specific schedule.

SELECT * FROM `travel-sample` t
UNNEST schedule sch
WHERE t.type = "route"
AND sch = {"day":2, "flight": "US681", "utc": "19:20:00"};

Covering Array Index

Covering indexes is an efficient method of using an Index for a particular query, whereby the index itself can completely cover the query in terms of providing all data required for the query. Basically, it avoids the fetch phase of the query processing and related overhead in fetching the required documents from data-service nodes. See more details here.

Array indexing requires special attention to create covered array indexes. In general, the array field itself should be included as one of the index keys in the CREATE INDEX definition. For example, the index C1 does not cover the query Q1 because the Q1 projection list includes * which needs to fetch the document from the Data Service. The following Q7 is covered by index C7:

C7: Creating a Covered Array Index.

CREATE INDEX idx_sched_covered ON `travel-sample`
   ( DISTINCT ARRAY v.flight FOR v IN schedule END, schedule)
WHERE type = "route";
Q7: Covered Array Index using the ANY clause.
EXPLAIN SELECT meta().id FROM `travel-sample`
USE INDEX (idx_sched_covered) WHERE type = "route"
AND ANY v IN schedule SATISFIES v.flight LIKE 'UA%' END;

Result:
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan2",
              "covers": [
                 "cover ((DISTINCT (ARRAY (`v`.`flight`) FOR `v`
                    IN (`travel-sample`.`schedule`) END)))",
                 "cover ((`travel-sample`.`schedule`))",
                 "cover ((meta(`travel-sample`).`id`))"
              ],
              "filter_covers": {
                "cover ((`travel-sample`.`type`))": "route", "cover (any `v` IN (`travel-sample`.`schedule`) SATISFIES ((\"UA" <= (`v`.`flight`)) AND ((`v`.`flight`) < \"UB\")) END)": true, "cover (ANY `v` IN (`travel-sample`.`schedule`) SATISFIES ((`v`.`flights`) LIKE \"UA%\" END)": true
              },
              "index": "idx_sched_covered",
       ...
Note: The query Q7 needs index C7 to cover it because the query predicate refers to the array schedule in the ANY operator.
Note: The index keys of an index must be used in the WHERE clause of a DML to use the index for that query. In the SELECT or DML WHERE clause, Covered Array Indexes can be used by the following operators:
  • ANY: As shown in query Q7.
  • ANY AND EVERY: As shown in query Q7A (a variant of Example Q7).

Q7A: Covered Array Index using the ANY AND EVERY clause.

EXPLAIN SELECT meta().id FROM `travel-sample`
USE INDEX (idx_sched_covered)
WHERE type = "route"
AND ANY AND EVERY v IN schedule SATISFIES v.flight LIKE 'UA%' END;
 
Result:
   {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan2",
              "covers": [
                 "cover ((DISTINCT (ARRAY (`v`.`flight`) FOR `v`
                    IN (`travel-sample`.`schedule`) END)))",
                 "cover ((`travel-sample`.`schedule`))",
                 "cover ((meta(`travel-sample`).`id`))"
              ],
              "filter_covers": {
        "cover ((`travel-sample`.`type`))": "route",
              },
              "index": "idx_sched_covered",
       ...

Q7B: Covered Array Index using the UNNEST clause and aliasing.

EXPLAIN SELECT meta().id FROM `travel-sample` t
USE INDEX (idx_sched_covered)
UNNEST schedule v WHERE travel-sample.type = "route" AND v.flight LIKE 'UA%';

Result:
    { 
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan2",
              "covers": [
                 "cover ((DISTINCT (ARRAY (`v`.`flight`) FOR `v`
                    IN (`t`.`schedule`) END)))",
                 "cover ((`t`.`schedule`))",
                 "cover ((meta(`t`).`id`))"
              ],
              "filter_covers": {
                "cover ((`t`.`type`))": "route",
              },
              "index": "idx_sched_covered",
       ...
Note: The Q7 Examples have the following limitation: the collection operator EVERY cannot use array indexes or covered array indexes because the EVERY operator needs to apply the SATISFIES predicate to all elements in the array, including the case where an array has zero elements. As items cannot be indexed, it is not possible to index MISSING items, so the EVERY operator is evaluated in the N1QL engine and cannot leverage the array index scan. For example, the following query Q7C uses the non-array index def_type ignoring the USE INDEX hint to use the array indexes (note that query C7 defines a DISTINCT array index while C7C defines an ALL array index, and both are ignored).

C7C: Non-array index with an ALL array index.

CREATE INDEX idx_sched_covered_all ON `travel-sample`
   ( ALL ARRAY v.flight FOR v IN schedule END, schedule)
WHERE type = "route";

Q7C: Non-array index with an ALL array index.

EXPLAIN SELECT meta().id FROM `travel-sample`
USE INDEX (idx_sched_covered_all, idx_sched_covered)
WHERE type = "route" 
AND EVERY v IN schedule SATISFIES v.flight LIKE 'UA%' END;

Result:
{
  "plan": {
     "#operator": "Sequence",
     "~children": [
       {
         "#operator": "IndexScan2",
         "index": "def_type",
         ...

Implicit Covered Array Index

N1QL supports simplified Implicit Covered Array Index syntax in certain cases where the mandatory array index-key requirement is relaxed to create a covering array-index. This special optimization applies to those queries and DML which have WHERE clause predicates that can be exactly and completely pushed to the indexer during the array index scan. For example:

  • ANY operator with an =, <, >, and LIKE predicate in the SATISFIES clause. Not that the GSI indexes are tree structures that support exact match and range matches. And the ANY predicate returns true as long as it finds at least one matching item in the index. Hence, an item found in the index can cover the query. Furthermore, this is covered by both ALL and DISTINCT array indexes.
    C8: Creating an Implicit Covered Array Index with DISTINCT.
    CREATE INDEX idx_sched_covered_simple ON `travel-sample`
      ( DISTINCT ARRAY v.flight FOR v IN schedule END)
    WHERE type = "route";
    Q8: Implicit Covered Array Index using the ANY clause.
    EXPLAIN SELECT meta().id FROM `travel-sample`
    USE INDEX (idx_sched_covered_simple)
    WHERE type = "route"
    AND ANY v IN schedule SATISFIES v.flight LIKE 'UA%' END;
    
    Result:
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "DistinctScan",
            "scan": {
              "#operator": "IndexScan2",
              "covers": [
                "cover ((DISTINCT (ARRAY (`v`.`flight`) FOR `v` 
                       IN (`travel-sample`.`schedule`) END)))",
                "cover ((meta(`travel-sample`).`id`))"
              ],
              "filter_covers": {
                "cover ((`travel-sample`.`type`))": "route",
                "cover (any `v` in (`travel-sample`.`schedule`)
                       SATISFIES ((\"UA\" <= (`v`.`flight`)) 
                       AND ((`v`.`flight`) < \"UB\")) END)": true,
                "cover (any `v` in (`travel-sample`.`schedule`)
                       SATISFIES ((`v`.`flight`) LIKE \"UA%\") END)": true
              },
              ...
  • UNNEST operator with =, <, >, or LIKE predicate in the WHERE clause. This applies to only ALL array indexes because, for such index, all array elements are indexed in the array index, and the UNNEST operation needs all the elements to reconstruct the array. Note that the array cannot be reconstructed if on DISTINCT elements of the array are indexed.

    For example, the following query Q8A can be covered with the ALL index idx_sched_covered_simple_all in C8A, but Q8B is not covered when using the DISTINCT index idx_sched_covered_simple defined in C8.

    C8A: UNNEST covered with the ALL index.
    CREATE INDEX idx_sched_covered_simple_all ON `travel-sample`
      ( ALL ARRAY v.flight FOR v IN schedule END)
    WHERE type = "route";

    Q8A: UNNEST not covered when using the DISTINCT index.

    EXPLAIN SELECT meta(t).id FROM `travel-sample` t
    USE INDEX (idx_sched_covered_simple_all)
    UNNEST schedule v
    WHERE t.type = "route"
    AND v.flight LIKE 'UA%';
    
    Result:
    {
      "plan": {
        "#operator": "Sequence",
        "~children": [
          {
            "#operator": "IndexScan2",
            "covers": [
              "cover ((`v`.`flight`))",
              "cover ((meta(`t`).`id`))"
            ],
            "filter_covers": {
              "cover (((`t`.`schedule`) < {}))": true,
              "cover (([] <= (`t`.`schedule`)))": true,
              "cover ((`t`.`type`))": "route",
              "cover (is_array((`t`.`schedule`)))": true
            },
            "index": "idx_sched_covered_simple_all",
            "index_id": "623509c163434cd5",
            "keyspace": "travel-sample",
            "namespace": "default",
            "spans": [
              {
                "exact": true,
                "range": [
                  {
                    "high": "\"UB\"",
                    "inclusion": 1,
                    "low": "\"UA\""
                  }
                ]
              }
            ],
            "using": "gsi"
          }
    ...

Limitations

Let's use the following sample document with Doc_Id "foo" to explain the limitations:
"foo": {
          "a":1,
          "b":[1,2],
          "c":{"ca":[1,2,3], "cb":4},  
          "d":[{"da":5,"db":6},
          {"da":7,"db":8}],
          "e":[{"ea":9,"eb":[10,11,12]},
          {"ea":13,"eb":[14,NULL,16]}],
          "f":[[17,18],
          [19,20,21]]
          }   
  • Covering indexes with indexed arrays do not cover queries where the array needs to be reconstructed in full form, with duplicates and position of each element placed correctly in the projection.
    Supported Example


    SELECT a 
             FROM default 
             WHERE ANY i IN b SATISFIES i < 5 END;


    SELECT ARRAY_DISTINCT(b) 
            FROM default 
            WHERE a = 5;


    SELECT b,a 
           FROM default;
  • Indexed arrays do not maintain duplicate elements of an array or the position of the elements within an array in the GSI array index. This means that GSI array indexes do not cover expressions that reference the array attribute itself. For example, the following statements are not supported:
    SELECT b FROM default;
    SELECT b[*] FROM default;
    SELECT b[1] FROM default;
  • Array indexes only support ANY, ANY AND EVERY, ARRAY, DISTINCT, IN, WITHIN, and UNNEST operators. Other operators such as ALL and EVERY are not supported.
    Note: EVERY operator evaluates to true for arrays with zero elements, whereas ANY AND EVERY evaluates to true when the array has at least one matching element.
  • The total size of the array index keys cannot exceed 10K for a single document. The array index key size is calculated using the total size of all array elements being indexed in a single document. If the total array index key size exceeds 10K in a single document, the items are skipped. The following error is logged to indicate that an item is skipped when building the index: "Encoded array key is too long" in the indexer.log file. The indexer.log file is included in cbcollect_info output. For example, the array key size for the following index is calculated by adding all the elements in this list : "{[1,1], [1,2]}". You can contact Couchbase Support for details on how to change the limit on array index key size.
    CREATE INDEX i1 on default(a,ARRAY x FOR x IN b END) USING GSI;

Summary

The following table summarizes N1QL-supported collection operators in the DML WHERE clause for different kinds of array index features:
Table 1. N1QL-supported collection operators
Operator in the SELECT/DML WHERE clause Array index with same variable names in Index definition and DML Array index with arbitrary Variable names in Index definition and DML Covered Array Index (with explicit array index-key) Implicit Covered Array Index (without explicit array index-key)
ANY ✓ (both ALL & DISTINCT) ✓ (both ALL & DISTINCT) ✓ (both ALL & DISTINCT) ✓ (both ALL & DISTINCT)
UNNEST ✓ (only ALL, with array as leading index-key) ✓ (only ALL, with array as leading index-key) ✓ (only ALL, with array as leading index-key)
ANY AND EVERY ✓ (both ALL & DISTINCT) ✓ (both ALL & DISTINCT) ✓ (both ALL & DISTINCT)
EVERY