In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database and Hugging Face as the AI-powered embedding model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval.
This tutorial uses Couchbase's Search Vector Index for vector similarity search. For more information on vector indexes, see the Couchbase Vector Index Documentation.
This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using Hyperscale or Composite Vector Indexes, please take a look at this tutorial.
This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run interactively. You can access the original notebook here.
You can either download the notebook file and run it on Google Colab or run it on your system by setting up the Python environment.
To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.
To know more, please follow the instructions.
When running Couchbase using Capella, the following prerequisites need to be met.
!pip --quiet install couchbase==4.4.0 transformers==4.56.1 sentence_transformers==5.1.0 langchain-community==0.3.29 langchain_huggingface==0.3.1 python-dotenv==1.1.1 ipywidgetsfrom pathlib import Path
from datetime import timedelta
from transformers import pipeline, AutoModel, AutoTokenizer
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import (ClusterOptions, ClusterTimeoutOptions,
QueryOptions)
import couchbase.search as search
from couchbase.options import SearchOptions
from couchbase.vector_search import VectorQuery, VectorSearch
import uuid
import os
from dotenv import load_dotenv
import getpass
In order to run this tutorial, you will need access to a Couchbase Cluster with Search Service enabled either through Couchbase Capella or by running it locally, and have credentials to access a collection on that cluster:
# Load environment variables
load_dotenv("./.env")
# Configuration
couchbase_cluster_url = os.getenv('CB_CLUSTER_URL') or input("Couchbase Cluster URL:")
couchbase_username = os.getenv('CB_USERNAME') or input("Couchbase Username:")
couchbase_password = os.getenv('CB_PASSWORD') or getpass.getpass("Couchbase password:")
couchbase_bucket = os.getenv('CB_BUCKET') or input("Couchbase Bucket:")
couchbase_scope = os.getenv('CB_SCOPE') or input("Couchbase Scope:")
couchbase_collection = os.getenv('CB_COLLECTION') or input("Couchbase Collection:")In this section, we first need to create a PasswordAuthenticator object that would hold our Couchbase credentials:
auth = PasswordAuthenticator(
couchbase_username,
couchbase_password
)Then, we use this object to connect to Couchbase Cluster and select specified above bucket, scope and collection:
print("Connecting to cluster")
cluster = Cluster(couchbase_cluster_url, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))
bucket = cluster.bucket(couchbase_bucket)
scope = bucket.scope(couchbase_scope)
collection = scope.collection(couchbase_collection)
print("Connected to the cluster")Connecting to cluster
Connected to the clusterIn order to store Hugging Face-generated embeddings onto a Couchbase Cluster, a Search Vector Index needs to be created first. We included a sample index definition that will work with this tutorial in a file named huggingface_index.json located in the folder with this tutorial.
The definition can be used to create a Search Vector Index using Couchbase server web console. For more information on vector indexes, please read Create a Vector Search Index with the Server Web Console.
Please note that the index is configured for documents from bucket huggingface, scope _default and collection huggingface. You will need to edit the source and document type name in the index definition file if your collection, scope, or bucket names are different.
Here, our code verifies the existence of the index and will throw an exception if the index has not been found:
search_index_name = couchbase_bucket + "._default.vector_test"
search_index = cluster.search_indexes().get_index(search_index_name)
print("Found index: " + search_index_name)Found index: huggingface._default.vector_testembedding_model = HuggingFaceEmbeddings()
print("Initialized successfully")Initialized successfullyAfter initializing the Hugging Face transformers library, it can be used to generate vector embeddings for user input or a predefined set of phrases. Here, we're generating embeddings for the strings contained in the array:
texts = [
"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.",
"It’s used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.",
input("Enter custom embedding text:")
]
embeddings = []
for i in range(0, len(texts)):
embeddings.append(embedding_model.embed_query(texts[i]))Generated embeddings are then stored as vector fields inside documents that can contain additional information about the vector, including the original text. The documents are then upserted onto the Couchbase cluster:
for i in range(0, len(texts)):
doc = {
"id": str(uuid.uuid4()),
"text": texts[i],
"vector": embeddings[i],
}
collection.upsert(doc["id"], doc)After the documents are upserted onto the cluster, their vector fields will be added to the previously imported Search Vector Index. Later, new embeddings can be added or used to perform a similarity search on the previously added documents:
def search_similar(text):
print("Vector similarity search for phrase: \"" + text + "\"")
search_embedding = embedding_model.embed_query(text)
search_req = search.SearchRequest.create(search.MatchNoneQuery()).with_vector_search(
VectorSearch.from_vector_query(
VectorQuery(
"vector", search_embedding, num_candidates=1
)
)
)
result = scope.search(
"vector_test",
search_req,
SearchOptions(
limit=13,
fields=["vector", "id", "text"]
)
)
for row in result.rows():
print("Found answer: " + row.id + "; score: " + str(row.score))
doc = collection.get(row.id)
print("Answer text: " + doc.value["text"])
search_similar("name a multipurpose database with distributed capability")
print("------")
search_similar(input("Enter custom search phrase:"))Vector similarity search for phrase: "name a multipurpose database with distributed capability"
Found answer: 3993ec2e-c184-4d7f-8fc3-55961afe264c; score: 0.9256534967756203
Answer text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON’s versatility, with a foundation that is extremely fast and scalable.
------
Vector similarity search for phrase: "What is the data in the sample text?"
Found answer: a7748fac-b41f-4846-bebc-d89bdcd645e3; score: 1.0016003788325407
Answer text: this is a sample text with the data "Qwerty"