Hadoop Connector 1.2

Hadoop Connector 1.2

The Couchbase Hadoop Connector allows you to connect to Couchbase Server 2.5, 3.x, or 4.x; in order to stream keys into either Hadoop Distributed File System (HDFS) or Hive, for processing with Hadoop.

If you have previously used Apache Sqoop with other databases, using the Couchbase Hadoop Connector should be straightforward: because it uses a similar command-line argument-structure (although some of the arguments themselves might seem slightly different, because Couchbase has a very different structure than does a typical RDBMS).

Getting Started

Download the Couchbase Hadoop Connector version 1.2.0, from http://packages.couchbase.com/clients/connectors/couchbase-hadoop-plugin-1.2.0.zip.

The Couchbase Hadoop Connector is supported on Cloudera 5. Cloudera has certified the Couchbase Hadoop Connector 1.2 release for Cloudera 5.

The Couchbase Hadoop Connector is supported on Hortonworks Data Platform (HDP) 2.2. Hortonworks has certified the Couchbase Hadoop Connector 1.2 release for HDP 2.2.

Installation

You can install the Couchbase Hadoop Connector either by running a script, or by manually copying the files to specified directories within your Sqoop installation. The distribution-package contains a set of files that needs to be copied into your Sqoop installation; and a script that copies the files for you, if you provide the path to the Sqoop installation.

The following table describes the files in the Couchbase Hadoop Connector distribution, and lists where each file is installed. In the installation location column, $sqoop_home represents the path to your Sqoop installation.

Table 1. Files in the Couchbase Hadoop Connector package
File name Description Installation location
couchbase-client-1.4.4.bundled.jar A library-dependency of the connector that handles the basic communications with the Couchbase cluster $sqoop_home/lib
couchbase-config.xml A property file used to register a ManagerFactory for the connector with Sqoop $sqoop_home/conf
couchbase-hadoop-plugin-1.2.0.jar The Couchbase Hadoop Connector $sqoop_home/lib
couchbase-manager.xml A property file that tells Sqoop where the ManagerFactory defined in the couchsqoop-config.xml resides $sqoop_home/conf/managers.d
install.sh The Couchbase Hadoop Connector installation script Not applicable
jettison-1.1.jar A dependency of the Couchbase client $sqoop_home/lib
netty-3.5.5.Final.jar A dependency of the Couchbase client $sqoop_home/lib
spymemcached-2.11.4.jar A library-dependency of the Couchbase client that provides networking and core protocol handling for data transfer $sqoop_home/lib

Script-based Installation

Script-based installation is done through the use of the install.sh script, which comes with the connector-download. The script takes one argument, which is the path to your Sqoop installation. The basic command-format for invoking the script is:

shell> sh install.sh path_to_sqoop_home

In an HDP deployment, Sqoop is located at /usr/hdp/current/sqoop-client, which you use as the path to the Sqoop installation. For HDP, invoke the installation script as follows:

shell> sh install.sh /usr/hdp/current/sqoop-client
Manual Installation

To install the Couchbase Hadoop Connector manually, copy each JAR and XML file listed in the table Files in the Couchbase Hadoop Connector package into the directory specified in the installation-location column.

Uninstallation

Uninstallation of the connector requires removal of all of the files that were added to Sqoop during installation. To uninstall the files, cd into your Sqoop home directory, and execute the following command:

shell> rm lib/couchbase-hadoop-plugin-1.2.0.jar \
    lib/spymemcached-2.11.4.jar \
    lib/jettison-1.1.jar \
    lib/netty-3.5.5.Final.jar \
    lib/couchbase-client-1.4.4.bundled.jar \
    conf/couchbase-config.xml \
    conf/managers.d/couchbase-manager.xml
 

Using Sqoop

The Couchbase Hadoop Connector can be used with a variety of command-line tools provided by Sqoop. In this section, we discuss the usage of each tool.

Tables

Since Sqoop is built for a relational model, it requires that the user specify a table, to import and export into Couchbase. The Couchbase Hadoop Connector uses the ‑‑table option, to specify the type of data stream for importing and exporting into Couchbase.

For exports, the user must enter a value for the --table option, though what is entered will not be used by the connector.

For imports, the table command accepts two values; and will exit reporting errors with invalid input.

  • DUMP—Causes all keys currently in Couchbase to be read into HDFS. Any data-items received by the Couchbase cluster while this command is running will also be passed along by the connector; meaning new or changed items are part of the dump. However, items removed while the dump is running will not be removed from the output.

  • BACKFILL_##—Streams all key-mutations, for a specified amount of time (in minutes). For example, BACKFILL_5 means stream key mutations in the Couchbase server for 5 minutes, and then stop the stream.

Connect string

A connect string option is required to connect to Couchbase. This can be specified with --connect as an argument to the sqoop command. The following example shows some connect strings:

http://10.2.1.55:8091/pools
http://10.2.1.55:8091/pools,http://10.2.1.56:8091/pools

When creating your connect strings, replace the IP address above with the host name or IP address of one or more nodes in your Couchbase Cluster. If you have multiple servers, you can list them in a comma-separated list.

Connecting to different buckets

By default, the Couchbase Hadoop Connector connects to the default bucket. If you want to connect to a bucket other than the default bucket, you can specify the bucket-name by using the --username option. If the bucket has a password, use the --password option, followed by the password.

Note that there are several variations on how the password may be specified to Sqoop. The -P argument allows the password to be supplied on the command line, and the --password-file argument allows the password to be read from a file within HDFS.

Importing

Importing data to your cluster requires the use of the Sqoop import command followed by the parameters --connect and --table.

The following example dumps all items from Couchbase into HDFS. Since the Couchbase Java Client has support for a number of different data types, all values are normalized to strings when being written to a Hadoop text file.

shell> sqoop import --connect http://10.2.1.55:8091/pools --table DUMP \
    --fields-terminated-by '\t' --escaped-by \\ --enclosed-by '\"'

The following example streams all item mutations from Couchbase into HDFS, for a period of 10 minutes.

shell> sqoop import --connect http://10.2.1.55:8091/pools --table BACKFILL_10 \
    --fields-terminated-by '\t' --escaped-by \\ --enclosed-by '\"'

In both of the above examples, the delimiters for fields, new lines, and escape have been explicitly specified. Note that Sqoop's default delimiters are comma (,), for fields; and newline character (\n), for records: these defaults should not be used when field-data itself contains commas or newline characters, in case records become ambiguous or unparsable.

Sqoop provides many more options to the import command than are covered in this document. Run sqoop import help for a list of all options, and see the Sqoop documentation for more details about these options.

You have a number of options for how to supply the password when accessing a bucket. The following examples are equivalent for a bucket named mybucket, which uses the password mypassword:

shell> sqoop import --username mybucket -P --verbose \
    --connect http://10.2.1.55:8091/pools --table DUMP \
    --fields-terminated-by '\t' --escaped-by \\ --enclosed-by '\"'
shell> sqoop import --username mybucket --password mypassword --verbose \
    --connect http://10.2.1.55:8091/pools --table DUMP \ 
    --fields-terminated-by '\t' --escaped-by \\ --enclosed-by '\"'
shell> sqoop import --username mybucket --password-file passwordfile \ 
    --verbose --connect http://10.2.1.55:8091/pools --table DUMP \
    --fields-terminated-by '\t' --escaped-by \\ --enclosed-by '\"'

When the import job executes, it also generates a .java source-code file; which can facilitate reading and writing the records imported by other Hadoop MapReduce jobs. If, for instance, the job was a DUMP, Sqoop generates a DUMP.java source-code file.

Exporting

Exporting data to your cluster requires the use of the sqoop export command, followed by the parameters --connect, --export-dir, and --table.

The following example exports all records from the files in the HDFS directory specified by --export-dir into Couchbase.

shell> sqoop export --connect http://10.2.1.55:8091/pools \
    --table couchbaseExportJob --export-dir data_for_export \
    --fields-terminated-by '\t' --escaped-by \\ --enclosed-by '\"'

When the export job executes, it also generates a .java source code file that shows how the data was read. If, for instance, the job had the argument --table couchbaseExportJob, Sqoop generates a couchbaseExportJob.java source code file.

List table

Sqoop has a tool called list-tables. Couchbase does not have a notion of tables, but we use DUMP and BACKFILL_## as values to the --table option.

Since there is no real purpose to the list-tables command in the case of the Couchbase Hadoop Connector, you are not recommended to use this argument.

Import all tables

Sqoop has a tool called import-all-tables. Couchbase does not have a notion of tables.

Since there is no real purpose to the import-all-tables command in the case of the Couchbase Hadoop Connector, it is not recommended you use this argument to Sqoop.

Limitations

While Couchbase provides many great features to import and export data from Couchbase to Hadoop, there is some functionality that the connector does not implement in Sqoop:

  • Querying: You cannot run queries on Couchbase. All tools that attempt to do this will fail with a NotSupportedException.

  • list-databases tool: Even though Couchbase is a multitenant system that allows for multiple buckets (which are analogous to databases), there is no way of listing these buckets from Sqoop. The list of buckets is available through the Couchbase Cluster web console.

  • eval-sql tool: Couchbase does not use SQL, so this tool is not appropriate.

  • The Couchbase Hadoop Connector does not automatically handle some classes of failure in a Couchbase cluster, and does not automatically handle changes to Couchbase cluster-topology, while the Sqoop task is being run.