cbimport

cbimport

cbimport is a utility to import data into a Couchbase cluster.

Note: The cbimport feature is a Developer Preview version. This feature is considered experimental and the functionality may change.

Synopsis

cbimport [--version] [--help] <command> [<args>]

Description

cbimport is used to import data from various sources into Couchbase. For more information on how specific commands work, run cbimport <command> --help.

Options

Table 1. cbimport options
Option Description

--version

Prints the cbimport suite version that the cbimport program came from.
--help Prints the synopsis and a list of commands. If a cbimport command is named, this option will bring up the manual page for that command.

Commands

cbimport csv: Imports data into Couchbase from a CSV file.

cbimport json: Imports data into Couchbase from a JSON file.

Discussion

The cbimport command is used to import data from various sources into a Couchbase cluster. Each supported format is a sub-command of the cbimport utility.

cbimport csv

cbimport csv imports CSV data into a Couchbase cluster.

Synopsis

cbimport csv [--cluster <url>] [--bucket <bucket_name>] [--dataset <path>]
          [--username <username>] [--password <password>][--generate-key <key_expr>]
          [--limit-rows <num>] [--skip-rows <num>][--field-separator <char>]
          [--cacert <path>] [--no-ssl-verify] [--threads <num>]
          [--error-log <path>][--log-file <path>]

Description

Imports CSV and other forms of separated value type data into Couchbase. By default, data files should start with a line containing comma-separated column names, followed by one or more lines of comma-separated values. However, if you are importing data that using a different field separator, for example tabs, you can use the --field-separator flag to specify that tabs are used instead of commas.

The cbimport command also supports custom key-generation for each document in the imported file. Key generation is done with a combination of pre-existing fields in a document and custom generator functions supplied by cbimport. For details about key generators, see Key Generators.

Options

The following tables list the required and optional parameters for the cbimport csv command.

Table 2. Required options for cbimport csv
Option Description
-c,--cluster <url> The host name of a node in the cluster to import data into. See Host Formats for details about host name specification formats.
-u,--username <username> The user name for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to.
-p,--password <password> The password for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to. Specifying this option without a value will allow the user to type a non-echoed password to stdin.
-b,--bucket <bucket_name> The name of the bucket to import data into.
-d,--dataset <uri> The URI of the dataset to be loaded. cbimport supports loading data from a local file or from a URL. When importing data from a local file the path must be prefixed with file://. If a URL is used then the file should be prefixed with either http:// or https://.
Table 3. Optional options for cbimport csv
Option Description
-g,--generate-key <key_expr> Specifies a key expression used for generating a key for each document imported. See Key Generators for more information on specifying key generators.
--field-separator <num> Specifies the field separator to use when reading the dataset. By default the separator is a comma. To read tab separated files you can specify a tab in this field. Note that in bash shell a tab is specified as $'\t'.
--limit-rows <num> Specifies that the utility should stop loading data after reading a certain amount of rows from the dataset. This option is useful when you have a large dataset and only want to partially load it.
--skip-rows <num> Specifies that the utility should skip some rows before starting to import data. If this flag is used together with the --limit-rows flag then cbimport imports the number of rows specified by --limit-rows after skipping the rows specified by --skip-rows.
--no-ssl-verify Skips the SSL verification phase. Specifying this flag will allow a connection using SSL encryption, but will not verify the identity of the server you connect to. You are vulnerable to a man-in-the-middle attack if you use this flag. Either this flag or the --cacert flag must be specified when using an SSL encrypted connection.
--infer-types By default all values in a CSV files are interpreted as strings. If infer types is set then cbimport will look at each value and decide whether it is a string, integer, or boolean value and put the inferred type into the document.
--omit-empty Some values in a CSV row will not contain any data. By default these values are put into the generated JSON document as an empty string. Use this flag to omit fields that contain empty values.
--cacert Specifies a CA certificate that will be used to verify the identity of the server being connecting to. Either this flag or the --no- ssl-verify flag must be specified when using an SSL encrypted connection.
-t,--threads <num> Specifies the number of concurrent clients to use when importing data. Fewer clients means imports will take longer, but there will be less cluster resources used to complete the import. More clients means faster imports, but at the cost of more cluster resource usage. This parameter defaults to 1 if it is not specified and it is recommended that this parameter is not set to be higher than the number of CPUs on the machine where the import is taking place.
-e,--errors-log <path> Specifies a log file where JSON documents that could not be loaded are written to. A document might not be loaded if a key could not be generated for the document or if the document is not valid JSON. The errors file is written in the "json lines" format (one document per line).
-l,--log-file Specifies a log file for writing debugging information about cbimport execution.

Host Formats

When specifying a host for the cbimport command the following formats are expected:
  • couchbase://<addr>
  • <addr>:<port>
  • http://<addr>:<port>

We recommend using the couchbase://<addr> format for standard installations. The other two formats allow an option to take a port number which is needed for non-default installations where the admin port has been set up on a port other that 8091.

Key Generators

Key generators are used in order to generate a unique key for each document loaded. Keys can be generated by using a combination of column values (indicated by wrapping the column name with “%”), custom generators (currently #MONO_INCR# which returns a monotonically increasing integer, and #UUID# which returns a UUID), and arbitrary characters for formatting.

Here is an example of a key generation expression. Given the CSV dataset:

fname,age 
alice,40 
barry,36
And the key generator expression:
--generate-key key::%fname%::#MONO_INCR#
The following keys are generated:
key::alice::1 
key::barry::2

In the example above we generate a key using the value in each row of the "fname" field and a custom generator. To specify that we want to substitute the value in the "fname" field we put the name of the field "fname" between two percent signs. This is an example of field substitution and it allows the ability to build keys out of data that is already in the dataset.

This example also contains a generator function MONO_INCR which will increment by 1 each time the key generator is called. Since this is the first time this key generator was executed it returns 1. If we executed the key generator again it would return 2 and so on.

Any text that isn't wrapped in "%" or "#" is static text and will be in the result of all generated keys. If a key needs to contain a "%" or "#" in static text then they need to be escaped by providing a double "%" or "#" (ex. "%%" or "##").

If a key cannot be generated because the column specified in the key generator is not present in the row then that row will be skipped. To see a list of rows that were not imported due to failed key generation, users can specify the --errors-log <path> parameter to dump a list of all rows that could not be imported.

Examples

The examples in this section illustrate importing data from the following files:

/data/people.csv
fname,age
alice,40
barry,36
/data/people.tsv
fname  age
alice  40
barry  36
To import data from /data/people.csv using a key containing the "fname" column and utilizing 4 threads the following command can be run.
$ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
            -b default -d file:///data/people.csv -g key::%fname% -t 4
To import data from /data/people.tsv using a key containing the "fname" column and the UUID generator the following command would be run.
$ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
            -b default -d file:///data/people.tsv --field-separator $'\t' \
            -g key::%fname%::#UUID# -t 4
If the dataset in not available on the local machine where the command is run, but is available via an HTTP URL we can still import the data using cbimport. If we assume that the data is located at http://data.org/people.csv then we can import the data with the following command.
$ cbimport csv -c couchbase://127.0.0.1 -u Administrator -p password \
            -b default -d http://data.org/people.csv -g key::%fname%::#UUID# -t 4

Discussion

The cbimport csv command is used to quickly import data from various files containing CSV, TSV, or other separated format data. While importing CSV the cbimport command only utilizes a single reader. As a result importing large dataset may benefit from being partitioned into multiple files and running a separate cbimport process on each file.

cbimport json

cbimport json imports JSON data into a Couchbase cluster.

Synopsis

cbimport json [--cluster <url>] [--bucket <bucket_name>] [--dataset <path>]
              [--format <data_format>][--username <username>][--password <password>]
              [--generate-key <key_expr>][--cacert <path>][--no-ssl-verify]
              [--threads <num>] [--error-log <path>][--log-file <path>]

Description

Imports JSON data into Couchbase. The cbimport command supports files that have a JSON document on each line, files that contain a JSON list (that is array) where each element is a document, and the Couchbase Samples format. The file format can be specified with the --format flag. See Dataset Formats for more details on the supported file formats.

The cbimport command also supports custom key-generation for each document in the imported file. Key generation is done with a combination of pre-existing fields in a document and custom generator functions supplied by cbimport. For details about key generators, see Key Generators.

Options

The following tables list the required and optional parameters for the cbimport json command.

Table 4. Required options for cbimport json
Option Description
-c,--cluster <url> The host name of a node in the cluster to import data into. See Host Formats for details about host name specification formats.
-u,--username <username> The user name for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to.
-p,--password <password> The password for cluster authentication. The user must have the appropriate privileges to write to the bucket in which data will be loaded to. Specifying this option without a value will allow the user to type a non-echoed password to stdin.
-b,--bucket <bucket_name> The name of the bucket to import data into.
-d,--dataset <uri> The URI of the dataset to be loaded. cbimport supports loading data from a local file or from a URL. When importing data from a local file the path must be prefixed with file://. If a URL is used then the file should be prefixed with either http:// or https://.
-f,--format <format> The format of the dataset specified (lines, list, sample). See Dataset Formats for more details on the formats supported by cbimport.
Table 5. Optional options for cbimport json
Option Description
-g,--generate-key <key_expr> Specifies a key expression used for generating a key for each document imported. This parameter is required for list and lines formats, but not for the sample format. See Key Generators for more information on specifying key generators.
--no-ssl-verify Skips the SSL verification phase. Specifying this flag will allow a connection using SSL encryption, but will not verify the identity of the server you connect to. You are vulnerable to a man-in-the-middle attack if you use this flag. Either this flag or the --cacert flag must be specified when using an SSL encrypted connection.
--cacert Specifies a CA certificate that will be used to verify the identity of the server being connecting to. Either this flag or the --no- ssl-verify flag must be specified when using an SSL encrypted connection.
-t,--threads <num> Specifies the number of concurrent clients to use when importing data. Fewer clients means imports will take longer, but there will be less cluster resources used to complete the import. More clients means faster imports, but at the cost of more cluster resource usage. This parameter defaults to 1 if it is not specified and it is recommended that this parameter is not set to be higher than the number of CPUs on the machine where the import is taking place.
-e,--errors-log <path> Specifies a log file where JSON documents that could not be loaded are written to. A document might not be loaded if a key could not be generated for the document or if the document is not valid JSON. The errors file is written in the "lines" format (one document per line).
-l,--log-file Specifies a log file for writing debugging information about cbimport execution.

Host Formats

When specifying a host for the cbimport command the following formats are expected:
  • couchbase://<addr>
  • <addr>:<port>
  • http://<addr>:<port>

We recommend using the couchbase://<addr> format for standard installations. The other two formats allow an option to take a port number which is needed for non-default installations where the admin port has been set up on a port other that 8091.

Dataset Formats

The cbimport command supports the following formats:
  • Lines
    The lines format specifies a file that contains one JSON document on every line in the file. This format is specified by setting the --format option to "lines". Here's an example of a file in lines format:
    {"key": "mykey1", "value": "myvalue1"}
    {"key": "mykey2", "value": "myvalue2"}
    {"key": "mykey3", "value": "myvalue3"}
    {"key": "mykey4", "value": "myvalue4"}
  • List
    The list format specifies a file which contains a JSON list where each element in the list is a JSON document. The file may only contain a single list, but the list may be specified over multiple lines. This format is specified by setting the --format option to "list". Here's an example of a file in list format:
    [
      {
        "key": "mykey1",
        "value": "myvalue1"
      },
      {"key": "mykey2", "value": "myvalue2"},
      {"key": "mykey3", "value": "myvalue3"},
      {"key": "mykey4", "value": "myvalue4"}
    ]
  • Sample
    The sample format specifies a ZIP file or folder containing multiple documents. This format is intended to load Couchbase sample data sets. Unlike the lines and list formats, the sample format may also contain index, view, and full-text index definitions. Here's the folder structure for the sample format:
    + (root folder)
      + docs
        key1.json
        key2.json
        ...
      + design_docs
        indexes.json
        views.json

    All documents in the samples format are contained in the docs folder and there is one file per document. Each file name in the docs folder is the key name for the JSON document contained in the file. If the file name contains a .json extension then the extension is excluded from the key name during the import. This name can be overridden if the --generate-key option is specified. The docs folder may also contain sub-folders of documents to be imported. Sub-folders can be used to organize large amounts of documents into a more readable categorized form.

    The design_docs folder contains index definitions. The file name indexes.json is reserved for secondary indexes. All other file names are used for view indexes.

Key Generators

Key generators are used in order to generate a key for each document loaded. Keys can be generated by using a combination of characters, the values of a given field in a document, and custom generators. Field substitutions are done by wrapping the field name in "%" and custom generators are wrapped in "#".

Here is an example of a key generation expression. Given the document:

{
  "name": "alice",
  "age": 40
}
And the key generator expression:
--generate-key key::%name%::#MONO_INCR#
The following keys are generated:
key::alice::1 

In the example above we generate a key using both the value of a field in the document and a custom generator. We use the "name" field to use the value of the name field as part of the key. This is specified by "%name%" which tells the key generator to substitute the value of the field "name" into the key.

This example also contains a generator function MONO_INCR which will increment by 1 each time the key generator is called. Since this is the first time this key generator was executed it returns 1. If we executed the key generator again it would return 2 and so on. The cbimport command currently contains a monotonic increment generator (MONO_INCR) and a UUID generator (UUID).

Any text that isn't wrapped in "%" or "#" is static text and will be in the result of all generated keys. If a key needs to contain a "%" or "#" in static text then they need to be escaped by enclosing it in double quotes "%" or "#" (ex. "%%" or "##").

If a key cannot be generated because the field specified in the key generator is not present in the document then the key will be skipped. To see a list of document that were not imported due to failed key generation users can specify the --errors-log <path> parameter to dump a list of all documents that could not be imported to a file.

Examples

The examples in this section illustrate importing data from the following files:
/data/lines.json
{"name": "alice", "age": 37}
{"name": "bob", "age": 39}
/data/list.json
[
  {"name": "candice", "age": 42},
  {"name": "daniel", "age": 38}
]
To import data from /data/lines.json using a key containing the "name" field and utilizing 4 threads the following command can be run.
$ cbimport json -c couchbase://127.0.0.1 -u Administrator -p password \
  -b default -d file:///data/lines.json -f lines -g key::%name% -t 4
To import data from /data/list.json using a key containing the "name" field and the UUID generator the following command would be run.
$ cbimport json -c couchbase://127.0.0.1 -u Administrator -p password \
   -b default -d file:///data/list.json -f list -g key::%name%::#UUID# -t 4
If the dataset is not available on the local machine where the command is run, but is available via an HTTP URL we can still import the data using cbimport. If we assume that the data is located at http://data.org/list.json and that the dataset is in the JSON list format, then we can import the data with the command below.
$ cbimport json -c couchbase://127.0.0.1 -u Administrator -p password \
   -b default -d http://data.org/list.json -f list -g key::%name%::#UUID# -t 4

Discussion

The cbimport json command is used to quickly import data from various files containing JSON data. While importing a JSON document, the cbimport command only utilizes a single reader. As a result importing large dataset may benefit from being partitioned into multiple files and running a separate cbimport process on each file.