Using Hugging Face Embeddings with Couchbase Search Vector Index

Learn how to generate embeddings using Hugging Face and store them in Couchbase.
This tutorial demonstrates how to use Couchbase's vector search capabilities with Hugging Face embeddings.
You'll understand how to perform vector search to find relevant documents based on similarity using Search Vector Index.

Introduction

In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and Hugging Face as the AI-powered embedding model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval.

This tutorial uses Couchbase's Search Vector Index for vector similarity search. For more information on vector indexes, see the Couchbase Vector Index Documentation.

This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using Hyperscale or Composite Vector Indexes, please take a look at this tutorial.

How to Run This Tutorial

This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run interactively. You can access the original notebook here.

You can either download the notebook file and run it on Google Colab or run it on your system by setting up the Python environment.

Before You Start

Create and Deploy Your Free Tier Operational Cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.

To know more, please follow the instructions.

Couchbase Capella Configuration

When running Couchbase using Capella, the following prerequisites need to be met.

Create the database credentials to access the travel-sample bucket (Read and Write) used in the application.
Allow access to the Cluster from the IP on which the application is running.

Install Necessary Libraries

!pip --quiet install couchbase==4.4.0 transformers==4.56.1 sentence_transformers==5.1.0 langchain-community==0.3.29 langchain_huggingface==0.3.1 python-dotenv==1.1.1 ipywidgets

Imports

from pathlib import Path
from datetime import timedelta
from transformers import pipeline, AutoModel, AutoTokenizer
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import (ClusterOptions, ClusterTimeoutOptions,
                               QueryOptions)
import couchbase.search as search
from couchbase.options import SearchOptions
from couchbase.vector_search import VectorQuery, VectorSearch
import uuid
import os
from dotenv import load_dotenv
import getpass

Prerequisites

In order to run this tutorial, you will need access to a Couchbase Cluster with Search Service enabled either through Couchbase Capella or by running it locally, and have credentials to access a collection on that cluster:

# Load environment variables
load_dotenv("./.env")

# Configuration
couchbase_cluster_url = os.getenv('CB_CLUSTER_URL') or input("Couchbase Cluster URL:")
couchbase_username = os.getenv('CB_USERNAME') or input("Couchbase Username:")
couchbase_password = os.getenv('CB_PASSWORD') or getpass.getpass("Couchbase password:")
couchbase_bucket = os.getenv('CB_BUCKET') or input("Couchbase Bucket:")
couchbase_scope = os.getenv('CB_SCOPE') or input("Couchbase Scope:")
couchbase_collection = os.getenv('CB_COLLECTION') or input("Couchbase Collection:")

Couchbase Connection

In this section, we first need to create a PasswordAuthenticator object that would hold our Couchbase credentials:

auth = PasswordAuthenticator(
    couchbase_username,
    couchbase_password
)

Then, we use this object to connect to Couchbase Cluster and select specified above bucket, scope and collection:

print("Connecting to cluster")
cluster = Cluster(couchbase_cluster_url, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))

bucket = cluster.bucket(couchbase_bucket)
scope = bucket.scope(couchbase_scope)
collection = scope.collection(couchbase_collection)
print("Connected to the cluster")

Connecting to cluster
Connected to the cluster

Creating Couchbase Search Vector Index

In order to store Hugging Face-generated embeddings onto a Couchbase Cluster, a Search Vector Index needs to be created first. We included a sample index definition that will work with this tutorial in a file named huggingface_index.json located in the folder with this tutorial.

The definition can be used to create a Search Vector Index using Couchbase server web console. For more information on vector indexes, please read Create a Vector Search Index with the Server Web Console.

Please note that the index is configured for documents from bucket huggingface, scope _default and collection huggingface. You will need to edit the source and document type name in the index definition file if your collection, scope, or bucket names are different.

Here, our code verifies the existence of the index and will throw an exception if the index has not been found:

search_index_name = couchbase_bucket + "._default.vector_test"
search_index = cluster.search_indexes().get_index(search_index_name)
print("Found index: " + search_index_name)

Found index: huggingface._default.vector_test

Hugging Face Initialization

embedding_model = HuggingFaceEmbeddings()
print("Initialized successfully")

Initialized successfully

Embedding Documents

After initializing the Hugging Face transformers library, it can be used to generate vector embeddings for user input or a predefined set of phrases. Here, we're generating embeddings for the strings contained in the array:

texts = [
    "Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.",
    "It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.",
    input("Enter custom embedding text:")
]
embeddings = []
for i in range(0, len(texts)):
    embeddings.append(embedding_model.embed_query(texts[i]))

Storing Embeddings in Couchbase

Generated embeddings are then stored as vector fields inside documents that can contain additional information about the vector, including the original text. The documents are then upserted onto the Couchbase cluster:

for i in range(0, len(texts)):
    doc = {
        "id": str(uuid.uuid4()),
        "text": texts[i],
        "vector": embeddings[i],
    }
    collection.upsert(doc["id"], doc)

Searching For Embeddings

After the documents are upserted onto the cluster, their vector fields will be added to the previously imported Search Vector Index. Later, new embeddings can be added or used to perform a similarity search on the previously added documents:

def search_similar(text):
    print("Vector similarity search for phrase: \"" + text + "\"")
    search_embedding = embedding_model.embed_query(text)
    
    search_req = search.SearchRequest.create(search.MatchNoneQuery()).with_vector_search(
        VectorSearch.from_vector_query(
            VectorQuery(
                "vector", search_embedding, num_candidates=1
            )
        )
    )
    result = scope.search(
        "vector_test", 
        search_req, 
        SearchOptions(
            limit=13, 
            fields=["vector", "id", "text"]
        )
    )
    for row in result.rows():
        print("Found answer: " + row.id + "; score: " + str(row.score))
        doc = collection.get(row.id)
        print("Answer text: " + doc.value["text"])
        
search_similar("name a multipurpose database with distributed capability")
print("------")
search_similar(input("Enter custom search phrase:"))

Vector similarity search for phrase: "name a multipurpose database with distributed capability"
Found answer: 3993ec2e-c184-4d7f-8fc3-55961afe264c; score: 0.9256534967756203
Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.
------
Vector similarity search for phrase: "What is the data in the sample text?"
Found answer: a7748fac-b41f-4846-bebc-d89bdcd645e3; score: 1.0016003788325407
Answer text: this is a sample text with the data "Qwerty"

Contents