INFER

INFER

Couchbase Server 4.5 introduces INFER, a N1QL statement that infers the metadata of documents in a keyspace, for example the structure of documents, data types of various attributes, sample values, and so on. Since a keyspace or bucket can contain documents with varying structures, the INFER statement is statistical in nature rather than deterministic. You can specify the sample size that must be used to analyze and identify the structure of documents in a keyspace or bucket. The INFER statement is only supported in the Enterprise Edition.

Note: The describe statement introduced in the Couchbase Server 4.1 release has been renamed to INFER.
The statement returns the output in the JSON Schema draft v4 format as specified by json-schema.org. It supports the following data types: array, boolean, integer, null, number, and object. For each identified attribute, the statement returns the following details:
  • #docs: Specifies the number of documents in the sample that contain this attribute.
  • %docs: Specifies the percentage of documents in the sample that contain this attribute.
  • minitems: If the data type is an array, specifies the minimum number of elements (array size).
  • maxitems: If the data type is an array, specifies the maximum number of elements (array size).
  • samples: Displays a list of sample values for the attribute found in the sample population.
  • type: Specifies the identified data type of the attribute.

The Query Workbench in the Couchbase Web console (available under the Query tab) uses the INFER statement to display the structure of documents in the Bucket Analysis area when you expand the keyspace or bucket name.

Syntax
INFER <keyspace_ref> [ WITH { "parameter" : value, … } ]
  • keyspace_ref: Specifies a simple or fully-qualified keyspace name.
  • parameter: Specifies one or more of the following comma separated parameters to guide the INFER statement.
    parameter ::= [ "sample_size" : <value> ] 
                  [ , "num_sample_values" : <value> ] 
                  [ , "similarity_metric" : <value> ]
                  [ , "dictionary_threshold" : <value> ]
  • sample_size: Specifies the number of documents to randomly sample in the keyspace or bucket when inferring the schema. The default sample size is 1000 documents. If a bucket contains fewer documents than the specified sample_size, then all the documents in the bucket will be used.
  • num_sample_values: Specifies the number of sample values for each attributes to be returned. The sample values provide examples of the data format. The default value is 5.
  • similarity_metric: The schema inferencing process groups similar schemas into document flavors. Similarity-metric is the degree of similarity that two schemas must have to be considered part of the same flavor. You can specify a real number between 0 and 1 indicating the percentage match (of attributes) required to establish similarity between two documents. The default value is 0.6, which means two documents are considered similar if 60% of the attributes at the top-level attributes are the same.
  • dictionary_threshold: Sometimes JSON documents follow the dictionary pattern, where a field has sub-fields that are key-value pairs, instead of general field-name and value pairs. For example, one of the Couchbase training modules has music tracks with a sub-document called "ratings" where each rating has the user ID paired with the object:
       "ratings": {
              "brambliertypo75631": {
                "created": 1439939260000,
                "rating": 1
              },
              "croakerraisiny16166": {
                "created": 1440066307000,
                "rating": 3
              },
              "libidinizeddepleting17126": {
                "created": 1439991036000,
                "rating": 1
              },
              "lightnots66650": {
                "created": 1440204913000,
                "rating": 1
              },
          },

    While this pattern may not be ideal for a number of reasons, if your data follows this pattern it might seem that the data has a huge number of ‘fields’, since a data value is being used as a field name. When the schema inferencing process sees more than dictionary_threshold fields with different names, but the same sub-document schema, it collapses them into a single schema field marked as a dictionary.

Example
INFER `beer-sample` WITH {"sample_size":10000,"num_sample_values":1,"similarity_metric":0.0};
[
    [ 
        {
            "#docs": 823,
            "$schema": "http://json-schema.org/schema#",
            "Flavor": "type = \"beer\"",
            "properties": {
                "abv": {
                    "#docs": 823,
                    "%docs": 100,
                    "samples": [
                        0,
                        9,
                        9.5,
                        8,
                        7.7
                    ],
                    "type": "number"
                },
                "brewery_id": {
                    "#docs": 823,
                    "%docs": 100,
                    "samples": [
                        "san_diego_brewing",
                        "drake_s_brewing",
                        "brouwerij_de_achelse_kluis",
                        "niagara_falls_brewing",
                        "brasserie_des_cimes"
                    ],
                    "type": "string"
                  },
                  "category": {
                      "#docs": 612,
                      "%docs": 74.36,
                      "samples": [
                          "North American Ale",
                          "British Ale",
                          "German Lager",
                          "Belgian and French Ale",
                          "Irish Ale"
                      ],
                      "type": "string"
                  },
                  "description": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          "Robust, Dark and Smooth, ho...",
                          "\"Pride of Milford\" is a very s...",
                          "Mogul is a complex blend of 5 ...",
                          "Just like our Porter but multi...",
                          ""
                      ],
                      "type": "string"
                  },
                  "ibu": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          0,
                          55,
                          35,
                          20
                      ],
                      "type": "number"
                  },
                  "name": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          "Old 395 Barleywine",
                          "Jolly Roger",
                          "Trappist Extra",
                          "Maple Wheat",
                          "Yeti"
                      ],
                      "type": "string"
                  },
                  "srm": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          0,
                          6,
                          45,
                          30
                      ],
                      "type": "number"
                  },
                  "style": {
                      "#docs": 612,
                      "%docs": 74.36,
                      "samples": [
                          "American-Style Pale Ale",
                          "Classic English-Style Pale Ale",
                          "American-Style India Pale Ale",
                          "German-Style Pilsener",
                          "Other Belgian-Style Ales"
                      ],
                      "type": "string"
                  },
                  "type": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          "beer"
                      ],
                      "type": "string"
                  },
                  "upc": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          0,
                          2147483647
                      ],
                      "type": "number"
                  },
                  "updated": {
                      "#docs": 823,
                      "%docs": 100,
                      "samples": [
                          "2010-07-22 20:00:20",
                          "2010-12-13 19:33:36",
                          "2011-05-17 03:27:08",
                          "2011-04-17 12:25:31",
                          "2011-04-17 12:33:50"
                      ],
                      "type": "string"
                  }
              }
          },
          {
              "#docs": 177,
              "$schema": "http://json-schema.org/schema#",
              "Flavor": "type = \"brewery\"",
              "properties": {
                  ...
              }
          }
      ]
]