Text Analysis

Text Analysis

Text analysis is the process of transforming input text into a series of terms, or canonical forms of words that are typically stripped of inflections and junk characters. Analysis is done on the text documents when you create the index, and again on the text of your query when you perform a search.

When people talk about full text search being "language aware", they are usually referring to the results of text analysis.

Example

Consider you have a document containing the text, "watery light beer" and you search for "watered down".

There is no exact match in this case, but if you want this query to match this document you can accomplish this by choosing the right analyzer. In this case, you want to use an analyzer that can stem English words.

By doing this, the index entry for the document will contain the terms: "water", "light", "beer". And the query will search for terms: "water", "down".

Now you see that the terms will match and you get the expected search results.

Stemming is just one way in which the text can be transformed, the key here is that choosing the right analyzer is essential to getting high quality search results.

Analyzers

Analyzers are used to transform input text into a stream of tokens for indexing. Full Text Search includes a number of analyzers or you can build your own. Analyzers are built up from modular components.
  • Character Filters strip out undesirable characters from the input. For example, the "html" character filter ignores HTML tags and just indexes the content from HTML documents.
  • Tokenizers split input strings into a token stream. Typically you’re trying to create a token for each word.
  • Token Filters are chained together to perform additional post processing on the token stream.

Tokenizers

Tokenizers cut sequences of characters into meaningful units (tokens) while ignoring useless characters, like spaces or commas.

Stemming is the process of reducing words to an underlying base form. The goal of stemming is to get rid of grammatical inflections while preserving the words’ meanings. Those underlying base forms are then written into the index.

In the Couchbase Server Web Console, you can find a list of available stemmers on the Index > Full Text > Edit or Create New Index > Analyzers. If you click "New Analyzer" you see a list of available stemmers in the "Token Filters" drop-down list.

Single Token

The Single Token Tokenizer returns the entire input bytes as a single token. This can be useful for things like URLs (http://www.example.com/blog/stuff+and+nonsense/) or email address (will.spammington@couchbase.com) that might otherwise be broken up at punctuation or special character boundaries. It could also be used when multi-word phrases should be indexed as a single term, for example, in the case of proper place names that would otherwise be broken into multiple terms by the whitespace tokenizer.

Consider the excerpt below from the brewery documents in the beer-sample database. You can use a single token analyzer for the city, name, country, phone, or website fields because they are all continuous strings that you may not want to divide at dots, dashes, or spaces. The single token tokenizer is probably less useful for the name field, which users will probably enter many different ways.
{
        "type": "brewery",
        "name": "21st Amendment Brewery Cafe",
        "city": "San Francisco",
        "country": "United States",
        "phone": "1-415-369-0900",
        "website": "http://www.21st-amendment.com/",
        ...
        }

Regular Expression

The Regular Expression Tokenizer will tokenize input using a configurable regular expression. The regular expression should match token text. See the Whitespace Tokenizer for an example of its usage.

Whitespace

The Whitespace Tokenizer is an instance of the Regular Expression Tokenizer. It uses the regular expression \w+. For many western languages this may work well as a simple tokenizer.

Unicode

The Unicode Tokenizer uses the segment library to perform Unicode Text Segmentation on word boundaries. It is recommended over the ICU tokenizer for all languages not requiring dictionary-based tokenization that is supported by ICU.

Full Text Search includes a number of analyzers or you can build your own.

Customizing the Field Mappings

Dynamic Index Mapping

When the indexer encounters a field whose type you haven’t explicitly specified, it guesses the type by looking at the JSON value.
Type of JSON value Indexed as...
Boolean Boolean
Number Number
String containing a date Date
String (not containing a date) String
The indexer attempts to parse String values as dates and indexes them as such if the operation succeeds. Note that other String values like "7" or "true" will never be indexed as a number or Boolean. The number type is modeled as 64-bit floating point value internally.