Couchbase is a document database and functions best when document contents are JSON. Nevertheless Couchbase may be used to store non-JSON data for various use cases. This page discusses how to use Couchbase with non-JSON documents, including strings and binary-data.
Non-JSON formats may be more efficient in terms of memory and processing power (for example, if storing only flat strings, JSON adds an additional syntactical overhead of two bytes per string). Non-JSON documents may be desirable if migrating a legacy application which is using a customized binary format.
Note that only JSON documents can be accessed using Query (N1QL). Limited support for non-JSON documents is available for MapReduce views. Additionally, non-JSON documents will not be accessible using the Web UI (the contents will be shown in their Base64 equivalent).
Using non-JSON documents
It's important to note that a JSON document can also refer to a simple integer (42), string ("hello"), array ([1,2,3]), boolean (true, false) and the JSON null value. Nevertheless if your application requires a non-JSON format, the SDK may still support it natively.
If there is no native support for your format, you can write a transcoder which handles the encoding and decoding of your documents to and from the server.
Every item (record) in Couchbase contains metadata stored along with it in the server. One of the metadata fields is a 32 bit "flag" value.
Couchbase SDKs accept native object types (integers, strings, arrays, dictionaries) as valid inputs for a Document and internally convert them to JSON before sending them to the server to be stored. When the SDK serializes the Document, it notes the type of serialization performed (JSON) and sends a corresponding type code along with the serialized document to the server to be stored. This type code is stored in the flags field within the item’s metadata.
Later, when retrieving the document from the server, the SDK checks the type code which informs it about the type of serialization used to encode the document, and thus how the SDK should de-serialize the document.
By default, SDKs will only accept document types which can serialize to JSON and will only serialize them to JSON. However applications can configure the SDK to use non-JSON serialization and accept other types of inputs as documents. Note that using non-JSON serialization will prevent the document from being accessible via Query and MapReduce.
The table below shows the built-in formats available in most SDKs. The name column displays the format name; the native type column displays the native language type which is used as input and output for the document, the description contains the properties of such a type, and the flags value contains the actual code used for the format. The format code is discussed more in depth below.
|Name||Native Type||Description||Flags value (see below)|
|JSON||Dictionary, Array, Number, String, Integer||Default serialization. Serializes document to JSON. Document can be used with Query||0x02 << 24|
|UTF-8||Unicode or String||Indicates the document is a UTF-8 string. This may be more space-efficient than JSON for documents that are a simple string, as JSON requires strings to be encapsulated by quotes. Using this serialization format may save two bytes for each string value||0x04 << 24|
|RAW||ByteArray, buffer, etc.||Indicates this value is a raw sequence of bytes. It is the simplest encoding form and indicates that the application will process and interpret its contents as it sees fit.||0x03 << 24|
|PRIVATE||SDK/Language dependent||This indicates that a language-specific serialization format is being used. The serialization format depends on the language (for example, Pickle for Python, Marshal for Ruby, Java Serialization for Java, etc). Using this format will make your documents inaccessible from other-language SDKs.||0x01 << 24|
If your application must store data that cannot be handled by any of the built-in SDK formats (for example, if the application wishes to store data as UTF-16), there are generally two options:
Using the RAW format
>>> from couchbase import FMT_BYTES >>> cb.upsert('utf16_doc', 'Hello, UTF-16 World!'.encode('utf16'), format=FMT_BYTES) OperationResult<RC=0x0, Key=u'utf16_doc', CAS=0x35cc17c7f213> >>> raw_buf = cb.get('utf16_doc').value >>> raw_buf '\xff\xfeH\x00e\x00l\x00l\x00o\x00,\x00 \x00U\x00T\x00F\x00-\x001\x006\x00 \x00W\x00o\x00r\x00l\x00d\x00!\x00' >>> utf16_doc = raw_buf.decode('utf16') >>> utf16_doc u'Hello, UTF-16 World!'
A Transcoder refers to a pair of functions which are responsible for serializing (encoding) a document before sending it to the server and de-serializing (decoding) a document to a suitable application type when it is retrieved.
Using transcoders is preferred over RAW serialization when possible, as it provides a cleaner interface as well as allows it to work in a mixed type environment (if there are multiple custom types) without having to encode the document type within the document itself.
The encoding function in the transcoder accepts a native Document type as input (as created by your application), encodes it as a byte buffer, and returns the byte buffer along with a type code.
The decoding function accepts a buffer (as fetched from the server) and a type code (the flags from the metadata) and returns the intended type to be used for the Document within the application.
from couchbase.transcoder import Transcoder UTF16_TYPECODE = 0x0A000000 class Utf16Transcoder(Transcoder): def encode_value(self, value, format): if (format == UTF16_TYPECODE): return value.encode('utf16'), UTF16_TYPECODE else: # Call default implementation return super(Utf16Transcoder, self).encode_value(value, format) def decode_value(self, value, flags): if flags == UTF16_TYPECODE: return value.decode('utf16') else: # Call default implementation return super(Utf16Transcoder, self).decode_value(value, flags) cb.transcoder = Utf16Transcoder() cb.upsert('utf16_doc', 'Hello, UTF-16 World!', format=UTF16_TYPECODE) cb.get('utf16_doc')
Format flags (type codes) and SDK interoperability
Modern Couchbase SDKs have standardized type codes for the various built-in document formats. This has not always been the case however, and older, legacy SDKs would use different flag values for typecodes (so for example, the code for a string value could be 100 or 4 depending on the SDK used).
In order to remain backwards-compatible with legacy SDKs and to retain interoperability with current SDKs, the standard typecodes follow the following format. Note that typecodes are stored under the flags field in the server’s metadata, which is a 32 bit field.
Current SDKs set the flags value using these two factors:
- The modern or common typecode: This is the modern SDK code for a given type, and is standard across all SDKs.
- The legacy or compat typecode: This is the code which was used by older versions of a given SDK. It is valid only for that language’s SDK. It is important to note that all legacy typecodes (regardless of language) are under 24 bits in width. Legacy SDKs will also often have a mask value (typically no wider than 16 bits).
(0x02 << 24) | (0x00) 0x02000000Another example: The legacy typecode for the RAW format in Python is 0x02, and the common type code is 0x03. The resultant typecode is:
(0x03 << 24) | (0x02) 0x03000002
When defining a new type code using the transcoder, ensure to keep the above information in mind, so as not to clash with any existing ones.