Understanding Analyzers

Understanding Analyzers

Analyzers increase search-awareness by transforming input text into token-streams, which permit the management of richer and more finely controlled forms of text-matching. An analyzer consists of modules, each of which performs a particular, sequenced role in the transformation.

Principles of Text-Analysis

Analyzers pre-process input-text submitted for Full Text Search; typically, by removing characters that might prohibit certain match-options. Analysis is performed on document-contents when indexes are created; and is also performed on the input-text submitted for a search. The benefit of analysis is often referred to as language awareness.

For example, if the input-text for a search is enjoyed staying here, and the document-content contains the phrase many enjoyable activities, the dictionary-based words do not permit a match. However, by using an analyzer that (by means of its inner Token Filter component) stems words, the input-text yields the tokens enjoy, stay, and here; while the document-content yields the tokens enjoy and activ. By means of the common token enjoy, this permits a match between enjoyed and enjoyable.

Since different analyzers pre-process text in different ways, effective Full Text Search depends on the right choice of analyzer, for the type of matches that are desired.

Couchbase Full Text Search provides a number of pre-constructed analyzers that can be used with Full Text Indexes. Additionally, analyzers can be custom-created, by means of the Couchbase Web Console. The remainder of this page explains the architecture of analyzers, and describes the modular components that Couchbase Full Text Search makes available for custom-creation. It also lists the pre-constructed analyzers that are available, and describes the modules that they contain.

For examples of both selecting and custom-creating analyzers by means of the Couchbase Web Console, see Creating Indexes.

Analyzer Architecture

Analyzers are built from modular components:

  • Character Filters remove undesirable characters from input: for example, the html character filter removes HTML tags, and indexes HTML text-content alone.
  • Tokenizers split input-strings into individual tokens, which together are made into a token stream. The nature of the decision-making whereby splits are made differs across tokenizers.
  • Token Filters are chained together, with each performing additional post-processing on each token in the stream provided by the tokenizer. This may include reducing tokens to the stems of the dictionary-based words from which they were derived, removing any remaining punctuation from tokens, and removing certain tokens deemed unnecessary.

Each component-type is described in more detail below. Note that these components can be used to custom-create an analyzer by means of the Couchbase Web Console. This is explained and exemplified in Creating Indexes.

Character Filters

Character Filters remove undesirable characters. The following filters are available:

  • html: Removes html elements such as <p>, and decodes expressions such as &amp; to their appropriate text equivalent.

  • zero_width_spaces: Substitutes a regular space-character for each zero-width non-joiner space.

Tokenizers

Tokenizers split input-strings into individual tokens: characters likely to prohibit certain kinds of matching (for example, spaces or commas) are omitted. The tokens so created are then made into a token stream for the query.

The following tokenizers are available from the Couchbase Web Console:

  • Letter: Creates tokens by breaking input-text into subsets that consist of letters only: characters such as punctuation-marks and numbers are omitted. Creation of a token ends whenever a non-letter character is encountered. For example, the text Reqmnt: 7-element phrase would return the following tokens: Reqmnt, element, and phrase.

  • Single: Creates a single token from the entirety of the input-text. For example, the text in each place would return the following token: in each place. Note that this may be useful for handling URLs or email-addresses, which can thus be prevented from being broken at punctuation or special-character boundaries. It may also be used to prevent multi-word phrases (for example, placenames such as Milton Keynes or San Francisco) from being broken up due to whitespace; so that they become indexed as a single term.

  • Unicode: Creates tokens by performing Unicode Text Segmentation on word-boundaries, using the segment library. For examples, see Unicode Word Boundaries.

  • Web: Creates tokens by identifying and removing html tags. For example, the text <h1>Introduction<\h1> would return the token Introduction.

  • Whitespace: Creates tokens by breaking input-text into subsets according to where whitespace occurs. For example, the text in each place would return the following tokens: in, each, and place.

Token Filters

Token Filters accept a token-stream provided by a tokenizer, and make modifications to the tokens in the stream.

A frequently used form of token filtering is stemming; this reduces words to a base form that typically consists of the initial stem of the word (for example, play, which is the stem of player, playing, playable, and more). With the stem used as the token, a wider variety of matches can be made (for example, the input-text player can be matched with the document-content playable).

The following kinds of token-filtering are supported by Couchbase Full Text Search:

  • dict_compoound: Allows user-specification of a dictionary whose words can be combined into compound forms, and individually indexed.

  • apostrophe: Removes all characters after an apostrophe, and the apostrophe itself. For example, they've becomes they.

  • elision: Identifies and removes characters that prefix a term and are separated from it by an apostrophe. For example, in French, l'avion becomes avion.

  • edge_ngram: From each token, computes n-grams that are rooted either at the front or the back.

  • keyword_marker: Identifies keywords and marks them as such. These are then ignored by any downstream stemmer.

  • normalize_unicode: Converts tokens into Unicode Normalization Form.

  • ngram: From each token, computes n-grams. There are two parameters, which are the minimum and maximum n-gram length.

  • shingle: Computes multi-token shingles from the token stream. For example, the token stream the quick brown fox, when configured with a shingle minimum and a shingle maximum length of 2, produces the tokens the quick, quick brown, and brown fox.

  • stemmer: Uses libstemmer to reduce tokens to word-stems.

  • stop_tokens: Removes from the stream tokens considered unnecessary for a Full Text Search: for example, and, is, and the.

  • to_lower: Converts all characters to lower case. For example, HTML becomes html.

  • truncate: Truncates each token to a maximum-permissible token-length.

  • possessive: Removes English possessives. For example, software's becomes software.

Note that token filters are frequently configured according to the special characteristics of individual languages. Couchbase Full Text Search provides multiple language-specific versions of the elision, normalize, stemmer, and stop token filters. Specially supported languages include ca (Catalan), fr (French), ga (Gaelic), it (Italian), ar (Arabic), ckb (Sorani Kurdish), fa (Persian), hi (Hindi), in (Indonesian), en (English), cs (Czech), el (Greek), eu (Basque), hy (Armenian), and pt (Portuguese). Additionally, token filters are provided for normalizing the width of and forming bigrams from tokens based on cjk (Chinese, Japanese, and Korean).

Pre-Constructed Analyzers

A number of pre-constructed analyzers are available, and can be selected from the Couchbase Web Console. For examples of selection, see Creating Indexes. The basic analyzers are as follows. See the sections above for details on the referenced analyzer-components.

  • keyword: Creates a single token representing the entire input, which is otherwise unchanged. This forces exact matches, and preserves characters such as spaces.

  • simple: Analysis by means of the Unicode tokenizer and the to_lower token filter.

  • standard: Analysis by means of the Unicode tokenizer, the to_lower token filter, and the stop token filter.

  • web: Analysis by means of the Web tokenizer and the to_lower token filter.

Additionally, a range of analyzers is provided for the specific support of certain languages. Each analyzer is named after the supported language: fr (French), it (Italian), ar (Arabic), ckb (Sorani Kurdish), ckj (Chinese, Japanese, and Korean), fa (Persian), hi (Hindi), and pt (Portuguese).