RAG with Capella Model Services, LangChain and Couchbase Hyperscale & Composite Vector Indexes

Learn how to build a semantic search engine using Capella Model Services and Couchbase Hyperscale or Composite Vector Indexes.
You will understand how to perform Retrieval-Augmented Generation (RAG) using LangChain and Capella Model Services with Query Service.

Introduction

In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application using Couchbase Capella as the database, Mistral-7B-Instruct-v0.3 model as the large language model provided by Capella Model Services. We will use the NVIDIA NeMo Retriever Llama3.2 model for generating embeddings via Capella Model Services.

This notebook demonstrates how to build a RAG system using:

The BBC News dataset containing news articles
Couchbase Capella as the vector store with Hyperscale and Composite Vector Indexes for high-performance vector search
Capella Model Services for embeddings and text generation
LangChain framework for the RAG pipeline

We leverage Couchbase's Query Service to create and manage Hyperscale Vector Indexes, enabling efficient semantic search capabilities that can scale to billions of vectors. Hyperscale and Composite indexes provide superior performance for large-scale vector search operations compared to traditional approaches. This tutorial can also be recreated using the Search Service with Search Vector Index.

Key Features:

High-performance vector search using Hyperscale/Composite indexes
Performance benchmarks showing optimization benefits
Complete RAG workflow with caching optimization

Requirements: Couchbase Server 8.0+ or Capella with Query Service enabled.

Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using Capella Model Services and LangChain

How to run this tutorial

This tutorial is available as a Jupyter Notebook (.ipynb file) that you can run interactively. You can access the original notebook here

You can either download the notebook file and run it on Google Colab or run it on your system by setting up the Python environment.

Before you start

Create and Deploy Your Operational cluster on Capella

To get started with Couchbase Capella, create an account and use it to deploy an operational cluster.

To know more, please follow the instructions.

Couchbase Capella Configuration

When running Couchbase using Capella, the following prerequisites need to be met:

Have a multi-node Capella cluster running the Data, Query, and Index services (Query Service is required for Hyperscale/Composite indexes).
Create the database credentials to access the bucket (Read and Write) used in the application.
Allow access to the Cluster from the IP on which the application is running.

Deploy Models

In order to create the RAG application, we need an embedding model to ingest the documents for Vector Search and a large language model (LLM) for generating the responses based on the context.

Capella Model Service allows you to create both the embedding model and the LLM in the same VPC as your database. There are multiple options for both the Embedding & Large Language Models, along with Value Adds to the models.

Create the models using the Capella Model Services interface. While creating the model, it is possible to cache the responses (both standard and semantic cache) and apply guardrails to the LLM responses.

For more details, please refer to the documentation. These models are compatible with the LangChain OpenAI integration.

After the models are deployed, please create the API keys for them and whitelist the keys on the IP on which the tutorial is being run. For more details, please refer to the documentation on generating the API keys.

Installing Necessary Libraries

To build our RAG system, we need a set of libraries. The libraries we install handle everything from connecting to databases to performing AI tasks. Each library has a specific role: Couchbase libraries manage database operations, LangChain handles AI model integrations, and we will use the OpenAI SDK for generating embeddings and calling the LLM in Capella Model services. By setting up these libraries, we ensure our environment is equipped to handle the tasks required for RAG.

%pip install --quiet datasets==4.4.1 langchain-couchbase==1.0.0 langchain-openai==1.1.0

Note: you may need to restart the kernel to use updated packages.

Importing Necessary Libraries

The script starts by importing a series of libraries required for various tasks, including handling JSON, logging, time tracking, Couchbase connections, embedding generation, and dataset loading. These libraries provide essential functions for working with data, managing database connections, and processing machine learning models.

Note that we import CouchbaseQueryVectorStore along with DistanceStrategy and IndexType for creating Hyperscale/Composite Vector Indexes.

import logging
import os
import time

from datetime import timedelta

from dotenv import load_dotenv

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.exceptions import CouchbaseException
from couchbase.options import ClusterOptions

from datasets import load_dataset

from langchain_core.documents import Document
from langchain_core.globals import set_llm_cache
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_couchbase.cache import CouchbaseCache
from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore
from langchain_couchbase.vectorstores import DistanceStrategy, IndexType
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from tqdm import tqdm

/Users/kaustavghosh/Library/Python/3.14/lib/python/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/Users/kaustavghosh/Library/Python/3.14/lib/python/site-packages/langchain_core/_api/deprecation.py:26: UserWarning: Core Pydantic V1 functionality isn't compatible with Python 3.14 or greater.
  from pydantic.v1.fields import FieldInfo as FieldInfoV1

Loading Environment Variables

This notebook loads configuration from a .env file in the same directory. Create a .env file with the following variables:

Required (no defaults):

CB_CONNECTION_STRING - Your Couchbase connection string
CB_USERNAME - Couchbase database username
CB_PASSWORD - Couchbase database password
CB_BUCKET_NAME - Name of your Couchbase bucket
CAPELLA_MODEL_SERVICES_ENDPOINT - Capella Model Services endpoint (include /v1 suffix)
LLM_API_KEY - API key for the LLM model
EMBEDDING_API_KEY - API key for the embedding model

Optional (with defaults):

SCOPE_NAME - Scope name (default: _default)
COLLECTION_NAME - Collection name (default: langchain_docs)
CACHE_COLLECTION - Cache collection name (default: cache)
LLM_MODEL_NAME - LLM model name (default: mistralai/mistral-7b-instruct-v0.3)
EMBEDDING_MODEL_NAME - Embedding model name (default: nvidia/llama-3.2-nv-embedqa-1b-v2)

Note: The Capella Model Services Endpoint requires /v1 suffix if not shown on the UI.

If the models are running in the same region, either API key can be used interchangeably. See generating API keys.

# Load environment variables from .env file
load_dotenv()

# Couchbase connection settings (no defaults for sensitive values)
CB_CONNECTION_STRING = os.getenv("CB_CONNECTION_STRING")
CB_USERNAME = os.getenv("CB_USERNAME")
CB_PASSWORD = os.getenv("CB_PASSWORD")
CB_BUCKET_NAME = os.getenv("CB_BUCKET_NAME")

# Collection settings (with sensible defaults)
SCOPE_NAME = os.getenv("SCOPE_NAME", "_default")
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "langchain_docs")
CACHE_COLLECTION = os.getenv("CACHE_COLLECTION", "cache")

# Capella Model Services settings
CAPELLA_MODEL_SERVICES_ENDPOINT = os.getenv("CAPELLA_MODEL_SERVICES_ENDPOINT")

# Model names (with defaults matching tutorial recommendations)
LLM_MODEL_NAME = os.getenv("LLM_MODEL_NAME", "mistralai/mistral-7b-instruct-v0.3")
EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "nvidia/llama-3.2-nv-embedqa-1b-v2")

# API keys (no defaults for sensitive values)
LLM_API_KEY = os.getenv("LLM_API_KEY")
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY")

# Validate required environment variables
if not all([
    CB_CONNECTION_STRING,
    CB_USERNAME,
    CB_PASSWORD,
    CB_BUCKET_NAME,
    CAPELLA_MODEL_SERVICES_ENDPOINT,
    LLM_API_KEY,
    EMBEDDING_API_KEY,
]):
    raise ValueError(
        "Missing required environment variables. Please ensure your .env file contains:\n"
        "- CB_CONNECTION_STRING\n"
        "- CB_USERNAME\n"
        "- CB_PASSWORD\n"
        "- CB_BUCKET_NAME\n"
        "- CAPELLA_MODEL_SERVICES_ENDPOINT\n"
        "- LLM_API_KEY\n"
        "- EMBEDDING_API_KEY"
    )

print("Environment variables loaded successfully")

Environment variables loaded successfully

Connecting to the Couchbase Cluster

Couchbase will serve as our primary data store, handling all the storage and retrieval operations required for our RAG system. By establishing this connection, we enable our application to interact with the database, allowing us to perform operations such as storing embeddings, querying data, and managing collections.

try:
    auth = PasswordAuthenticator(CB_USERNAME, CB_PASSWORD)
    options = ClusterOptions(auth)
    cluster = Cluster(CB_CONNECTION_STRING, options)
    cluster.wait_until_ready(timedelta(seconds=5))
    print("Successfully connected to Couchbase")
except Exception as e:
    raise ConnectionError(f"Failed to connect to Couchbase: {str(e)}")

Successfully connected to Couchbase

Setting Up Collections in Couchbase

In Couchbase, data is organized in buckets, which can be further divided into scopes and collections. Think of a collection as a table in a traditional SQL database. Before we can store any data, we need to ensure that our collections exist. If they don't, we must create them. This step is important because it prepares the database to handle the specific types of data our application will process. By setting up collections, we define the structure of our data storage, which is essential for efficient data retrieval and management.

Moreover, setting up collections allows us to isolate different types of data within the same bucket, providing a more organized and scalable data structure. This is particularly useful when dealing with large datasets, as it ensures that related data is stored together, making it easier to manage and query. Here, we clear the existing documents in the collection if any. If you do not want to do that, please skip this step.

def setup_collection(cluster, bucket_name, scope_name, collection_name, flush_collection=False):
    try:
        bucket = cluster.bucket(bucket_name)
        bucket_manager = bucket.collections()

        # Check if scope exists, create if it doesn't
        scopes = bucket_manager.get_all_scopes()
        scope_exists = any(scope.name == scope_name for scope in scopes)
        
        if not scope_exists:
            print(f"Scope '{scope_name}' does not exist. Creating it...")
            bucket_manager.create_scope(scope_name)
            print(f"Scope '{scope_name}' created successfully.")
            # Refresh scopes list after creation
            scopes = bucket_manager.get_all_scopes()
        else:
            print(f"Scope '{scope_name}' already exists. Skipping creation.")
        
        # Check if collection exists, create if it doesn't (reuse scopes variable)
        collection_exists = any(
            scope.name == scope_name
            and collection_name in [col.name for col in scope.collections]
            for scope in scopes
        )

        if not collection_exists:
            print(f"Collection '{collection_name}' does not exist. Creating it...")
            bucket_manager.create_collection(scope_name, collection_name)
            print(f"Collection '{collection_name}' created successfully.")
        else:
            print(f"Collection '{collection_name}' already exists. Skipping creation.")

        collection = bucket.scope(scope_name).collection(collection_name)
        time.sleep(2)  # Give the collection time to be ready for queries

        # Create primary index for the collection (required for DELETE operations)
        try:
            index_name = f"`{bucket_name}`.`{scope_name}`.`{collection_name}`"
            query = f"CREATE PRIMARY INDEX IF NOT EXISTS ON {index_name}"
            cluster.query(query).execute()
            print(f"Primary index created/verified for {collection_name}.")
        except Exception as e:
            print(f"Note: Could not create primary index: {str(e)}")

        if flush_collection:
            # Clear all documents in the collection
            try:
                query = f"DELETE FROM `{bucket_name}`.`{scope_name}`.`{collection_name}`"
                cluster.query(query).execute()
                print("All documents cleared from the collection.")
            except Exception as e:
                print(
                    f"Error while clearing documents: {str(e)}. The collection might be empty."
                )

    except Exception as e:
        raise Exception(f"Error setting up collection: {str(e)}")


# Setup main collection for vector store
setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, COLLECTION_NAME, flush_collection=True)

# Setup cache collection for LLM response caching
setup_collection(cluster, CB_BUCKET_NAME, SCOPE_NAME, CACHE_COLLECTION, flush_collection=True)

Scope 'shared' already exists. Skipping creation.
Collection 'langchain_query' already exists. Skipping creation.
Primary index created/verified for langchain_query.
All documents cleared from the collection.
Scope 'shared' already exists. Skipping creation.
Collection 'cache' already exists. Skipping creation.
Primary index created/verified for cache.
All documents cleared from the collection.

Load the BBC News Dataset

To build a RAG engine, we need data to search through. We use the BBC Realtime News dataset, a dataset with up-to-date BBC news articles grouped by month. This dataset contains articles that were created after the LLM was trained. It will showcase the use of RAG to augment the LLM.

The BBC News dataset's varied content allows us to simulate real-world scenarios where users ask complex questions, enabling us to fine-tune our RAG's ability to understand and respond to various types of queries.

try:
    news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split="train")
    print(f"Loaded the BBC News dataset with {len(news_dataset)} rows")
except Exception as e:
    raise ValueError(f"Error loading BBC dataset: {str(e)}")

Loaded the BBC News dataset with 2687 rows

Preview the Data

print(news_dataset[:5])

{'title': ["Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News", 'Lockdown DIY linked to Walleys Quarry gases - BBC News', 'Newscast - What next for the assisted dying bill? - BBC Sounds', "F1: Bernie Ecclestone to sell car collection worth 'hundreds of millions' - BBC Sport", 'British man Tyler Kerry from Basildon dies on holiday in Turkey - BBC News'], 'published_date': ['2024-12-01', '2024-12-01', '2024-12-01', '2024-12-01', '2024-12-01'], 'authors': ['https://www.facebook.com/bbcnews', 'https://www.facebook.com/bbcnews', None, 'https://www.facebook.com/BBCSport/', 'https://www.facebook.com/bbcnews'], 'description': ["Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.", 'An academic says an increase in plasterboard sent to landfill could be behind a spike in smells.', 'And rebel forces in Syria have taken control of Aleppo', 'Former Formula 1 boss Bernie Ecclestone is to sell his collection of race cars driven by motorsport legends including Michael Schumacher, Niki Lauda and Nelson Piquet.', 'Tyler Kerry was "a young man full of personality, kindness and compassion", his uncle says.'], 'section': ['Asia', 'Stoke & Staffordshire', None, 'Sport', 'Essex'], 'content': ['Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery\n\nImran Khan\'s wife, Bushra Bibi, encouraged protesters into the heart of Pakistan\'s capital, Islamabad\n\nA charred lorry, empty tear gas shells and posters of former Pakistan Prime Minister Imran Khan - it was all that remained of a massive protest led by Khan’s wife, Bushra Bibi, that had sent the entire capital into lockdown. Just a day earlier, faith healer Bibi - wrapped in a white shawl, her face covered by a white veil - stood atop a shipping container on the edge of the city as thousands of her husband’s devoted followers waved flags and chanted slogans beneath her. It was the latest protest to flare since Khan, the 72-year-old cricketing icon-turned-politician, was jailed more than a year ago after falling foul of the country\'s influential military which helped catapult him to power. “My children and my brothers! You have to stand with me,” Bibi cried on Tuesday afternoon, her voice cutting through the deafening roar of the crowd. “But even if you don’t,” she continued, “I will still stand firm. “This is not just about my husband. It is about this country and its leader.” It was, noted some watchers of Pakistani politics, her political debut. But as the sun rose on Wednesday morning, there was no sign of Bibi, nor the thousands of protesters who had marched through the country to the heart of the capital, demanding the release of their jailed leader. While other PMs have fallen out with Pakistan\'s military in the past, Khan\'s refusal to stay quiet behind bars is presenting an extraordinary challenge - escalating the standoff and leaving the country deeply divided. Exactly what happened to the so-called “final march”, and Bibi, when the city went dark is still unclear. All eyewitnesses like Samia* can say for certain is that the lights went out suddenly, plunging D Chowk, the square where they had gathered, into blackness.\n\nWithin a day of arriving, the protesters had scattered - leaving behind Bibi\'s burnt-out vehicle\n\nAs loud screams and clouds of tear gas blanketed the square, Samia describes holding her husband on the pavement, bloodied from a gun shot to his shoulder. "Everyone was running for their lives," she later told BBC Urdu from a hospital in Islamabad, adding it was "like doomsday or a war". "His blood was on my hands and the screams were unending.” But how did the tide turn so suddenly and decisively? Just hours earlier, protesters finally reached D Chowk late afternoon on Tuesday. They had overcome days of tear gas shelling and a maze of barricaded roads to get to the city centre. Many of them were supporters and workers of the Pakistan Tehreek-e-Insaf (PTI), the party led by Khan. He had called for the march from his jail cell, where he has been for more than a year on charges he says are politically motivated. Now Bibi - his third wife, a woman who had been largely shrouded in mystery and out of public view since their unexpected wedding in 2018 - was leading the charge. “We won’t go back until we have Khan with us,” she declared as the march reached D Chowk, deep in the heart of Islamabad’s government district.\n\nThousands had marched for days to reach Islamabad, demanding former Prime Minister Imran Khan be released from jail\n\nInsiders say even the choice of destination - a place where her husband had once led a successful sit in - was Bibi’s, made in the face of other party leader’s opposition, and appeals from the government to choose another gathering point. Her being at the forefront may have come as a surprise. Bibi, only recently released from prison herself, is often described as private and apolitical. Little is known about her early life, apart from the fact she was a spiritual guide long before she met Khan. Her teachings, rooted in Sufi traditions, attracted many followers - including Khan himself. Was she making her move into politics - or was her sudden appearance in the thick of it a tactical move to keep Imran Khan’s party afloat while he remains behind bars? For critics, it was a move that clashed with Imran Khan’s oft-stated opposition to dynastic politics. There wasn’t long to mull the possibilities. After the lights went out, witnesses say that police started firing fresh rounds of tear gas at around 21:30 local time (16:30 GMT). The crackdown was in full swing just over an hour later. At some point, amid the chaos, Bushra Bibi left. Videos on social media appeared to show her switching cars and leaving the scene. The BBC couldn’t verify the footage. By the time the dust settled, her container had already been set on fire by unknown individuals. By 01:00 authorities said all the protesters had fled.\n\nSecurity was tight in the city, and as night fell, lights were switched off - leaving many in the dark as to what exactly happened next\n\nEyewitnesses have described scenes of chaos, with tear gas fired and police rounding up protesters. One, Amin Khan, said from behind an oxygen mask that he joined the march knowing that, "either I will bring back Imran Khan or I will be shot". The authorities have have denied firing at the protesters. They also said some of the protesters were carrying firearms. The BBC has seen hospital records recording patients with gunshot injuries. However, government spokesperson Attaullah Tarar told the BBC that hospitals had denied receiving or treating gunshot wound victims. He added that "all security personnel deployed on the ground have been forbidden" from having live ammunition during protests. But one doctor told BBC Urdu that he had never done so many surgeries for gunshot wounds in a single night. "Some of the injured came in such critical condition that we had to start surgery right away instead of waiting for anaesthesia," he said. While there has been no official toll released, the BBC has confirmed with local hospitals that at least five people have died. Police say at least 500 protesters were arrested that night and are being held in police stations. The PTI claims some people are missing. And one person in particular hasn’t been seen in days: Bushra Bibi.\n\nThe next morning, the protesters were gone - leaving behind just wrecked cars and smashed glass\n\nOthers defended her. “It wasn’t her fault,” insisted another. “She was forced to leave by the party leaders.” Political commentators have been more scathing. “Her exit damaged her political career before it even started,” said Mehmal Sarfraz, a journalist and analyst. But was that even what she wanted? Khan has previously dismissed any thought his wife might have her own political ambitions - “she only conveys my messages,” he said in a statement attributed to him on his X account.\n\nImran Khan and Bushra Bibi, pictured here arriving at court in May 2023, married in 2018\n\nSpeaking to BBC Urdu, analyst Imtiaz Gul calls her participation “an extraordinary step in extraordinary circumstances". Gul believes Bushra Bibi’s role today is only about “keeping the party and its workers active during Imran Khan’s absence”. It is a feeling echoed by some PTI members, who believe she is “stepping in only because Khan trusts her deeply”. Insiders, though, had often whispered that she was pulling the strings behind the scenes - advising her husband on political appointments and guiding high-stakes decisions during his tenure. A more direct intervention came for the first time earlier this month, when she urged a meeting of PTI leaders to back Khan’s call for a rally. Pakistan’s defence minister Khawaja Asif accused her of “opportunism”, claiming she sees “a future for herself as a political leader”. But Asma Faiz, an associate professor of political science at Lahore University of Management Sciences, suspects the PTI’s leadership may have simply underestimated Bibi. “It was assumed that there was an understanding that she is a non-political person, hence she will not be a threat,” she told the AFP news agency. “However, the events of the last few days have shown a different side of Bushra Bibi.” But it probably doesn’t matter what analysts and politicians think. Many PTI supporters still see her as their connection to Imran Khan. It was clear her presence was enough to electrify the base. “She is the one who truly wants to get him out,” says Asim Ali, a resident of Islamabad. “I trust her. Absolutely!”', 'Walleys Quarry was ordered not to accept any new waste as of Friday\n\nA chemist and former senior lecturer in environmental sustainability has said powerful odours from a controversial landfill site may be linked to people doing more DIY during the Covid-19 pandemic. Complaints about Walleys Quarry in Silverdale, Staffordshire – which was ordered to close as of Friday – increased significantly during and after coronavirus lockdowns. Issuing the closure notice, the Environment Agency described management of the site as poor, adding it had exhausted all other enforcement tactics at premises where gases had been noxious and periodically above emission level guidelines - which some campaigners linked to ill health locally. Dr Sharon George, who used to teach at Keele University, said she had been to the site with students and found it to be clean and well-managed, and suggested an increase in plasterboard heading to landfills in 2020 could be behind a spike in stenches.\n\n“One of the materials that is particularly bad for producing odours and awful emissions is plasterboard," she said. “That’s one of the theories behind why Walleys Quarry got worse at that time.” She said the landfill was in a low-lying area, and that some of the gases that came from the site were quite heavy. “They react with water in the atmosphere, so some of the gases you smell can be quite awful and not very good for our health. “It’s why, on some days when it’s colder and muggy and a bit misty, you can smell it more.” Dr George added: “With any landfill, you’re putting things into the ground – and when you put things into the ground, if they can they will start to rot. When they start to rot they’re going to give off gases.” She believed Walleys Quarry’s proximity to people’s homes was another major factor in the amount of complaints that arose from its operation. “If you’ve got a gas that people can smell, they’re going to report it much more than perhaps a pollutant that might go unnoticed.”\n\nRebecca Currie said she did not think the site would ever be closed\n\nLocal resident and campaigner Rebecca Currie said the closure notice served to Walleys Quarry was "absolutely amazing". Her son Matthew has had breathing difficulties after being born prematurely with chronic lung disease, and Ms Currie says the site has made his symptoms worse. “I never thought this day was going to happen,” she explained. “We fought and fought for years.” She told BBC Midlands Today: “Our community have suffered. We\'ve got kids who are really poorly, people have moved homes.”\n\nComplaints about Walleys Quarry to Newcastle-under-Lyme Borough Council exceeded 700 in November, the highest amount since 2021 according to council leader Simon Tagg. The Environment Agency (EA), which is responsible for regulating landfill sites, said it had concluded further operation at the site could result in "significant long-term pollution". A spokesperson for Walley\'s Quarry Ltd said the firm rejected the EA\'s accusations of poor management, and would be challenging the closure notice. Dr George said she believed the EA was likely to be erring on the side of caution and public safety, adding safety standards were strict. She said a lack of landfill space in the country overall was one of the broader issues that needed addressing. “As people, we just keep using stuff and then have nowhere to put it, and then when we end up putting it in places like Walleys Quarry that is next to houses, I think that’s where the problems are.”\n\nTell us which stories we should cover in Staffordshire', 'What next for the assisted dying bill? What next for the assisted dying bill?', 'Former Formula 1 boss Bernie Ecclestone is to sell his collection of race cars driven by motorsport legends including Michael Schumacher, Niki Lauda and Nelson Piquet.\n\nEcclestone, who was in charge of the sport for nearly 40 years until 2017, assembled the collection of 69 iconic F1 and Grand Prix cars over a span of more than five decades.\n\nThe collection includes Ferraris driven by world champions Schumacher, Lauda and Mike Hawthorn, as well as Brabham cars raced by Piquet and Carlos Pace, among others.\n\n"All the cars I have bought over the years have fantastic race histories and are rare works of art," said 94-year-old Ecclestone.\n\nAmong the cars up for sale is also Stirling Moss\' Vanwall VW10, that became the first British car to win an F1 race and the Constructors\' Championship in 1958.\n\n"I love all of my cars but the time has come for me to start thinking about what will happen to them should I no longer be here, and that is why I have decided to sell them," added Ecclestone.\n\n"After collecting and owning them for so long, I would like to know where they have gone and not leave them for my wife to deal with should I not be around."\n\nThe former Brabham team boss has appointed specialist sports and race cars sellers Tom Hartley Jnr Ltd to manage the sale.\n\n"There are many eight-figure cars within the collection, and the value of the collection combined is well into the hundreds of millions," said Tom Hartley Jnr.\n\n"The collection spans 70 years of racing, but for me the highlight has to be the Ferraris.\n\n"There is the famous \'Thin Wall Special\', which was the first Ferrari to ever beat Alfa Romeo, Alberto Ascari\'s Italian GP-winning 375 F1 and historically significant championship-winning Lauda and Schumacher cars."\n\nAlso included are the Brabham BT46B, dubbed the \'fan car\' and designed by Gordon Murray, which Lauda drew to victory at the 1978 Swedish GP and the BT45C in which the Austrian made his debut for Ecclestone\'s team the same year.\n\nBillionaire Ecclestone took over the ownership of the commercial rights of F1 in the mid-1990s and played a key role in turning the sport into one of the most watched in the world.', 'Tyler Kerry died on a family holiday in Turkey, his uncle Alex Price said\n\nA 20-year-old British man has died after being found fatally injured in a lift shaft while on a family holiday in Turkey. Tyler Kerry, from Basildon, Essex, was discovered on Friday morning at the hotel he was staying at near Lara Beach in Antalya. The holidaymaker was described by his family as "a young man full of personality, kindness and compassion with his whole life ahead of him". Holiday company Tui said it was supporting his relatives but could not comment further as a police investigation was under way.\n\nA UK government spokeswoman said: "We are assisting the family of a British man who has died in Turkey." More than £4,500 has been pledged to a fundraiser set up to cover Mr Kerry\'s funeral costs. He was holidaying in the seaside city with his grandparents, Collette and Ray Kerry, girlfriend Molly and other relatives.\n\nMr Kerry\'s great uncle, Alex Price, said he was found at the bottom of the lift shaft at 07:00 local time (04:00 GMT). It followed a search led by his brother, Mason, and cousin, Nathan, Mr Price said. Mr Kerry had been staying on the hotel\'s first floor.\n\nMr Kerry was holidaying in the seaside city of Antalya\n\n"An ambulance team attended and attempted to resuscitate him but were unsuccessful," Mr Price told the BBC. "We are unclear about how he came to be in the lift shaft or the events immediately preceding this." Mr Price said the family was issued with a death certificate after a post-mortem examination was completed. They hoped his body would be repatriated by Tuesday. Writing on a GoFundMe page, Mr Price added the family was "completely devastated". He thanked people for their "kindness and consideration" following his nephew\'s death.\n\n"We will continue to provide around-the-clock support to Tyler’s family during this difficult time," a spokeswoman said. "As there is now a police investigation we are unable to comment further."\n\nDo you have a story suggestion for Essex?'], 'link': ['http://www.bbc.co.uk/news/articles/cvg02lvj1e7o', 'http://www.bbc.co.uk/news/articles/c5yg1v16nkpo', 'http://www.bbc.co.uk/sounds/play/p0k81svq', 'http://www.bbc.co.uk/sport/formula1/articles/c1lglrj4gqro', 'http://www.bbc.co.uk/news/articles/c1knkx1z8zgo'], 'top_image': ['https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/9975/live/b22229e0-ad5a-11ef-83bc-1153ed943d1c.jpg', 'https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/0896/live/55209f80-adb2-11ef-8f6c-f1a86bb055ec.jpg', 'https://ichef.bbci.co.uk/images/ic/320x320/p0k81sxn.jpg', 'https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/d593/live/232527a0-af40-11ef-804b-43d0a9651a27.jpg', 'https://ichef.bbci.co.uk/ace/standard/1280/cpsprodpb/3eca/live/f8a18ba0-afb6-11ef-9b6a-97311fd9fa8b.jpg']}

Cleaning up the Data

We will use the content of the news articles for our RAG system.

The dataset contains a few duplicate records. We are removing them to avoid duplicate results in the retrieval stage of our RAG system.

news_articles = news_dataset["content"]
unique_articles = set()
for article in news_articles:
    if article:
        unique_articles.add(article)
unique_news_articles = list(unique_articles)
print(f"We have {len(unique_news_articles)} unique articles in our database.")

We have 1749 unique articles in our database.

Creating Embeddings using Capella Model Service

Embeddings are at the heart of semantic search. They are numerical representations of text that capture the semantic meaning of the words and phrases. Unlike traditional keyword-based search, which looks for exact matches, embeddings allow our search engine to understand the context and nuances of language, enabling it to retrieve documents that are semantically similar to the query, even if they don't contain the exact keywords. By creating embeddings using Capella Model service, we equip our RAG system with the ability to understand and process natural language in a way that is much closer to how humans understand language. This step transforms our raw text data into a format that the Capella vector store can use to find and rank relevant documents.

We are using the OpenAI Embeddings via the LangChain OpenAI provider with a few extra parameters specific to the Capella Model Services such as disabling the tokenization and handling of longer inputs using the LangChain handler. We provide the model and api_key and the URL for the SDK to those for Capella Model Services. For this tutorial, we are using the nvidia/llama-3.2-nv-embedqa-1b-v2 embedding model. If you are using a different model, you would need to change the model name accordingly.

try:
    embeddings = OpenAIEmbeddings(
        openai_api_key=EMBEDDING_API_KEY,
        openai_api_base=CAPELLA_MODEL_SERVICES_ENDPOINT,
        check_embedding_ctx_length=False,
        tiktoken_enabled=False,
        model=EMBEDDING_MODEL_NAME,
    )
    print("Successfully created CapellaAIEmbeddings")
except Exception as e:
    raise ValueError(f"Error creating CapellaAIEmbeddings: {str(e)}")

Successfully created CapellaAIEmbeddings

Testing the Embeddings Model

We can test the embeddings model by generating an embedding for a string using the LangChain OpenAI package

print(len(embeddings.embed_query("this is a test sentence")))

Setting Up the Couchbase Query Vector Store

The vector store is set up to store the documents from the dataset. We use CouchbaseQueryVectorStore which enables Hyperscale and Composite Vector Indexes for high-performance vector storage and retrieval using Couchbase's Query Service.

Key differences from Search Vector Store:

Uses Query Service instead of Search Service
Supports Hyperscale indexes that can scale to billions of vectors
Index is created programmatically after data ingestion
No need for a separate JSON index definition file

try:
    vector_store = CouchbaseQueryVectorStore(
        cluster=cluster,
        bucket_name=CB_BUCKET_NAME,
        scope_name=SCOPE_NAME,
        collection_name=COLLECTION_NAME,
        embedding=embeddings,
        distance_metric=DistanceStrategy.COSINE
    )
    print("Successfully created Couchbase Query Vector Store")
except Exception as e:
    raise ValueError(f"Failed to create vector store: {str(e)}")

Successfully created Couchbase Query Vector Store

Saving Data to the Vector Store

With the Vector store set up, the next step is to populate it with data. We save the BBC articles dataset to the vector store using batch ingestion for efficiency. Each document will have its embeddings generated for semantic search using LangChain.

Important: With Hyperscale/Composite indexes, data must be ingested BEFORE creating the index. The index creation process analyzes existing vectors to optimize search performance through clustering and quantization.

Some articles may exceed the embedding model's maximum token limit (8192 tokens). These documents are automatically skipped during ingestion. For production use, consider splitting longer documents into chunks.

# Filter articles that are within token limits (50000 chars as rough estimate)
batch_size = 20  # Smaller batch size for better reliability with remote clusters
filtered_articles = [a for a in unique_news_articles if a and len(a) <= 50000]

print(f"Filtered {len(unique_news_articles) - len(filtered_articles)} articles exceeding length limit")
print(f"Ingesting {len(filtered_articles)} articles in batches of {batch_size}...")

try:
    vector_store.add_texts(
        texts=filtered_articles,
        batch_size=batch_size
    )
    print(f"\nIngestion complete: {len(filtered_articles)} documents ingested successfully")
except Exception as e:
    error_str = str(e).lower()
    if "timeout" in error_str or "exceeds maximum" in error_str or "token" in error_str:
        # Fall back to individual ingestion for problematic batches
        print(f"Batch ingestion encountered issues, falling back to individual ingestion...")
        skipped_count = 0
        ingested_count = 0
        
        for article in tqdm(filtered_articles, desc="Ingesting articles"):
            try:
                vector_store.add_texts(texts=[article])
                ingested_count += 1
            except Exception as inner_e:
                inner_error = str(inner_e).lower()
                if "timeout" in inner_error or "exceeds maximum" in inner_error or "token" in inner_error:
                    skipped_count += 1
                    continue
                else:
                    print(f"Failed to save document: {str(inner_e)[:100]}...")
                    skipped_count += 1
                    continue
        
        print(f"\nIngestion complete: {ingested_count} documents ingested, {skipped_count} skipped")
    else:
        raise Exception(f"Error during batch ingestion: {str(e)}")

Filtered 1 articles exceeding length limit
Ingesting 1748 articles in batches of 20...

Ingestion complete: 1748 documents ingested successfully

Vector Search Performance Testing

Now let's demonstrate the performance benefits of Hyperscale Vector Index by testing pure vector search performance. We'll compare:

Baseline Performance: Vector search without Hyperscale index optimization
Hyperscale-Optimized Performance: Same search with Hyperscale index

Vector Index Types Overview

Before we start testing, let's understand the index types available:

Hyperscale Vector Indexes:

Best for: Pure vector searches - content discovery, recommendations, semantic search
Performance: High performance with low memory footprint, designed to scale to billions of vectors
Optimization: Optimized for concurrent operations, supports simultaneous searches and inserts
Use when: You primarily perform vector-only queries without complex scalar filtering

Composite Vector Indexes:

Best for: Filtered vector searches that combine vector search with scalar value filtering
Performance: Efficient pre-filtering where scalar attributes reduce the vector comparison scope
Use when: Your queries combine vector similarity with scalar filters that eliminate large portions of data

For this tutorial, we'll create and test a Hyperscale index. See the alternative section below for Composite index configuration.

For more information, see Couchbase Hyperscale and Composite Vector Index Documentation.

Vector Search Test Function

def test_vector_search_performance(vector_store, query, label="Vector Search"):
    """Test pure vector search performance and return timing metrics"""
    print(f"\n[{label}] Testing vector search performance")
    print(f"[{label}] Query: '{query}'")
    
    start_time = time.time()
    
    try:
        results = vector_store.similarity_search_with_score(query, k=3)
        end_time = time.time()
        search_time = end_time - start_time
        
        print(f"[{label}] Vector search completed in {search_time:.4f} seconds")
        print(f"[{label}] Found {len(results)} documents")
        
        if results:
            doc, distance = results[0]
            print(f"[{label}] Top result distance: {distance:.6f} (lower = more similar)")
            preview = doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content
            print(f"[{label}] Top result preview: {preview}")
        
        return search_time
    except Exception as e:
        print(f"[{label}] Vector search failed: {str(e)}")
        return None

Test 1: Baseline Performance (No Hyperscale Index)

test_query = "What was Pep Guardiola's reaction to Manchester City's current form?"
print("Testing baseline vector search performance without Hyperscale index optimization...")
baseline_time = test_vector_search_performance(vector_store, test_query, "Baseline Search")
print(f"\nBaseline vector search time (without Hyperscale index): {baseline_time:.4f} seconds\n")

Testing baseline vector search performance without Hyperscale index optimization...

[Baseline Search] Testing vector search performance
[Baseline Search] Query: 'What was Pep Guardiola's reaction to Manchester City's current form?'
[Baseline Search] Vector search completed in 5.7207 seconds
[Baseline Search] Found 3 documents
[Baseline Search] Top result distance: 0.486858 (lower = more similar)
[Baseline Search] Top result preview: 'We have to find a way' - Guardiola vows to end relegation form

This video can not be played To pla...

Baseline vector search time (without Hyperscale index): 5.7207 seconds

Creating Hyperscale Vector Index

Now let's create a Hyperscale vector index to enable high-performance vector searches. The index creation is done programmatically through the vector store.

Index Configuration:

index_type: IndexType.HYPERSCALE for pure vector search, IndexType.COMPOSITE for filtered searches
index_name: Unique name for the index
index_description: Controls centroids and quantization settings

Index Configuration Details

The index_description parameter controls vector optimization through centroids and quantization:

Format: 'IVF[<centroids>],{PQ|SQ}<settings>'

IVF (Inverted File Index) - Centroids

Auto-selection: IVF,SQ8 (Couchbase selects optimal count)
Manual: IVF1000,SQ8 (1000 centroids)

Quantization Options

SQ (Scalar): SQ4, SQ6, SQ8 - simpler, good for general use
PQ (Product): PQ32x8 - better precision at similar compression

Common Examples

IVF,SQ8 - Recommended default
IVF1000,SQ6 - Higher compression
IVF,PQ32x8 - High precision

For more details, see Quantization & Centroid Settings.

print("Creating Hyperscale vector index...")
try:
    vector_store.create_index(
        index_type=IndexType.HYPERSCALE,
        index_name="langchain_hyperscale_index",
        index_description="IVF,SQ8"
    )
    print("Hyperscale vector index created successfully")
    
    # Wait for index to become available
    print("Waiting for index to become available...")
    time.sleep(5)
    
except Exception as e:
    if "already exists" in str(e).lower():
        print("Hyperscale vector index already exists, proceeding...")
    else:
        print(f"Error creating Hyperscale vector index: {str(e)}")

Creating Hyperscale vector index...
Hyperscale vector index created successfully
Waiting for index to become available...

Test 2: Hyperscale Optimized Performance

print("Testing vector search performance with Hyperscale optimization...")
hyperscale_time = test_vector_search_performance(vector_store, test_query, "Hyperscale")

Testing vector search performance with Hyperscale optimization...

[Hyperscale] Testing vector search performance
[Hyperscale] Query: 'What was Pep Guardiola's reaction to Manchester City's current form?'
[Hyperscale] Vector search completed in 2.2348 seconds
[Hyperscale] Found 3 documents
[Hyperscale] Top result distance: 0.486858 (lower = more similar)
[Hyperscale] Top result preview: 'We have to find a way' - Guardiola vows to end relegation form

This video can not be played To pla...

Alternative: Composite Index Configuration

If your use case requires complex filtering with scalar attributes, you can create a Composite index instead:

vector_store.create_index(
    index_type=IndexType.COMPOSITE,  # Instead of IndexType.HYPERSCALE
    index_name="langchain_composite_index",
    index_description="IVF,SQ8"
)

Composite indexes are optimized for queries that combine vector similarity with scalar filters (e.g., filtering by date, category, or other metadata fields).

Performance Summary

print("\n" + "="*60)
print("PERFORMANCE SUMMARY")
print("="*60)

print(f"Baseline Search Time:     {baseline_time:.4f} seconds")

if baseline_time and hyperscale_time:
    speedup = baseline_time / hyperscale_time if hyperscale_time > 0 else float('inf')
    percent_improvement = ((baseline_time - hyperscale_time) / baseline_time) * 100 if baseline_time > 0 else 0
    print(f"Hyperscale Search Time:   {hyperscale_time:.4f} seconds ({speedup:.2f}x faster, {percent_improvement:.1f}% improvement)")

print("\n" + "-"*60)
print("Index Recommendation:")
print("-"*60)
print("- Hyperscale: Best for pure vector searches, scales to billions of vectors")
print("- Composite: Best for filtered searches combining vector + scalar filters")

============================================================
PERFORMANCE SUMMARY
============================================================
Baseline Search Time:     5.7207 seconds
Hyperscale Search Time:   2.2348 seconds (2.56x faster, 60.9% improvement)

------------------------------------------------------------
Index Recommendation:
------------------------------------------------------------
- Hyperscale: Best for pure vector searches, scales to billions of vectors
- Composite: Best for filtered searches combining vector + scalar filters

Perform Semantic Search

Semantic search in Couchbase involves converting queries and documents into vector representations using an embeddings model. These vectors capture the semantic meaning of the text and are stored directly in Couchbase. When a query is made, Couchbase performs a similarity search by comparing the query vector against the stored document vectors.

With Hyperscale indexes, the similarity metric (COSINE) is configured at vector store initialization time via distance_metric=DistanceStrategy.COSINE. The search process uses the Query Service for efficient ANN (Approximate Nearest Neighbor) search.

Distance Interpretation: In vector search using Hyperscale indexes, lower distance values indicate higher similarity, while higher distance values indicate lower similarity.

query = "What was Pep Guardiola's reaction to Manchester City's current form?"

try:
    # Perform the semantic search
    start_time = time.time()
    search_results = vector_store.similarity_search_with_score(query, k=5)
    search_elapsed_time = time.time() - start_time

    # Display search results
    print(
        f"\nSemantic Search Results (completed in {search_elapsed_time:.2f} seconds):"
    )
    for doc, score in search_results:
        print(f"Score: {score:.4f}, ID: {doc.id}, Text: {doc.page_content[:200]}...")
        print("---"*20)

except CouchbaseException as e:
    raise RuntimeError(f"Error performing semantic search: {str(e)}")
except Exception as e:
    raise RuntimeError(f"Unexpected error: {str(e)}")

Semantic Search Results (completed in 1.01 seconds):
Score: 0.4869, ID: 39c8a6f95dfa45d998e4e08db2b9f5fa, Text: 'We have to find a way' - Guardiola vows to end relegation form

This video can not be played To play this video you need to enable JavaScript in your browser. 'Worrying' and 'staggering' - Why do Man...
------------------------------------------------------------
Score: 0.5177, ID: 4a9267a1438d417e93c6bb370c7a4169, Text: 'I am not good enough' - Guardiola faces daunting and major rebuild

This video can not be played To play this video you need to enable JavaScript in your browser. 'I am not good enough' - Guardiola s...
------------------------------------------------------------
Score: 0.5300, ID: 44b4b79069ca41989ce2c08852a8e9ac, Text: 'Self-doubt, errors & big changes' - inside the crisis at Man City

Pep Guardiola has not been through a moment like this in his managerial career. Manchester City have lost nine matches in their past...
------------------------------------------------------------
Score: 0.5420, ID: 08f3f072076d461eb91350c23a945fe3, Text: Manchester City boss Pep Guardiola has won 18 trophies since he arrived at the club in 2016

Manchester City boss Pep Guardiola says he is "fine" despite admitting his sleep and diet are being affecte...
------------------------------------------------------------
Score: 0.5654, ID: 18af04fedda940968d89fccca7efafe7, Text: What will Trump do about Syria? What will Trump do about Syria?...
------------------------------------------------------------

Retrieval-Augmented Generation (RAG) with Couchbase and LangChain

Couchbase and LangChain can be seamlessly integrated to create RAG (Retrieval-Augmented Generation) chains, enhancing the process of generating contextually relevant responses. In this setup, Couchbase serves as the vector store, where embeddings of documents are stored. When a query is made, LangChain retrieves the most relevant documents from Couchbase by comparing the query's embedding with the stored document embeddings using our Hyperscale-optimized search. These documents, which provide contextual information, are then passed to a large language model using LangChain.

The language model, equipped with the context from the retrieved documents, generates a response that is both informed and contextually accurate. This integration allows the RAG chain to leverage Couchbase's efficient storage and retrieval capabilities with Hyperscale performance, while the LLM handles the generation of responses based on the context provided by the retrieved documents.

Using the Large Language Model (LLM) in Capella Model Services

We'll be using the mistralai/mistral-7b-instruct-v0.3 large language model via the Capella Model Services inside the same network as the Capella operational database to process user queries and generate meaningful responses.

try:
    llm = ChatOpenAI(openai_api_base=CAPELLA_MODEL_SERVICES_ENDPOINT, openai_api_key=LLM_API_KEY, model=LLM_MODEL_NAME, temperature=0)
    logging.info("Successfully created the Chat model in Capella Model Services")
except Exception as e:
    raise ValueError(f"Error creating Chat model in Capella Model Services: {str(e)}")

llm.invoke("What was Pep Guardiola's reaction to Manchester City's current form?")

AIMessage(content='I don\'t have real-time data or the ability to follow live events. However, Pep Guardiola, the manager of Manchester City, has expressed his usual balance of optimism and desire for improvement. Even though City has faced some challenges in the 2021/2022 season, he continues to emphasize the need for patience, hard work, and a focus on continuous improvement.\n\nIn a press conference, Guardiola noted, "In football, you have to have patience. When I arrived, we were fifth and I said, \'okay, we are not far away.\' Now, we are not far away again." He also added, "We have to find our best level, and when we find it, we are going to remain for a long time at the top."\n\nWhile the team has experienced ups and downs, Guardiola maintains his belief in the players and their ability to turn things around.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 200, 'prompt_tokens': 21, 'total_tokens': 221, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_provider': 'openai', 'model_name': 'mistralai/mistral-7b-instruct-v0.3', 'system_fingerprint': None, 'id': 'chat-31630b67c8b14120993af8d86528fc0d', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019c0957-1010-79c2-9c08-dd63863ba94e-0', tool_calls=[], invalid_tool_calls=[], usage_metadata={'input_tokens': 21, 'output_tokens': 200, 'total_tokens': 221, 'input_token_details': {}, 'output_token_details': {}})

Setting Up Couchbase Cache

We set up a Couchbase-based cache to store and retrieve LLM responses. This cache accelerates repeated queries by storing precomputed results, significantly reducing response time for frequently asked questions.

When a query is first processed, the RAG chain retrieves relevant documents, generates a response using the LLM, and stores this response in the cache. For subsequent identical queries, the cached response is returned directly, bypassing the expensive LLM call.

try:
    cache = CouchbaseCache(
        cluster=cluster,
        bucket_name=CB_BUCKET_NAME,
        scope_name=SCOPE_NAME,
        collection_name=CACHE_COLLECTION,
    )
    set_llm_cache(cache)
    print("Successfully created and configured Couchbase cache")
except Exception as e:
    raise ValueError(f"Failed to create cache: {str(e)}")

Successfully created and configured Couchbase cache

Building the RAG Chain

template = """You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
    {context}
    Question: {question}"""
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
    {"context": vector_store.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
logging.info("Successfully created RAG chain")

# Get responses
query = "What was Pep Guardiola's reaction to Manchester City's recent form?"
try:
    start_time = time.time()
    rag_response = rag_chain.invoke(query)
    rag_elapsed_time = time.time() - start_time

    print(f"RAG Response: {rag_response}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Error occurred:", e)

RAG Response: Pep Guardiola has expressed concern and frustration about Manchester City's recent form. He described it as a difficult time and admitted that his sleep and diet have been affected by the team's string of losses. City have won just one of their past 10 games, which marked a significant decline from their early season form when they were unbeaten at the top of the Premier League.
RAG response generated in 4.78 seconds

Demonstrating Cache Performance

This tutorial uses two levels of caching:

Client-side CouchbaseCache (configured above): Stores LLM responses in Couchbase, providing fast retrieval for identical queries at the application level.
Server-side Capella Model Services caching: The model outputs can be cached (both semantic and standard cache) at the model service level. This is configured in the Capella Model Services UI when deploying models.

The following example demonstrates caching in action - notice how repeated queries are significantly faster:

queries = [
        "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?",
        "What was Pep Guardiola's reaction to Manchester City's recent form?", 
        "Who inaugurated the reopening of the Notre Dam Cathedral in Paris?", # Repeated query
]

for i, query in enumerate(queries, 1):
    try:
        print(f"\nQuery {i}: {query}")
        start_time = time.time()
        response = rag_chain.invoke(query)
        elapsed_time = time.time() - start_time
        print(f"Response: {response}")
        print(f"Time taken: {elapsed_time:.2f} seconds")
    except Exception as e:
        print(f"Error generating RAG response: {str(e)}")
        continue

Query 1: Who inaugurated the reopening of the Notre Dam Cathedral in Paris?
Response: I'm sorry for any confusion, but the documents provided don't contain information about the inauguration of the Notre Dame Cathedral in Paris. The questions you've asked are about the future of Syria, a character named Claire, a character named Zoe, and the direction for the Democrats. It's best to look up information about the inauguration of the Notre Dame Cathedral elsewhere. The inauguration was led by French President Emmanuel Macron after its restoration following a fire in 2019.
Time taken: 3.52 seconds

Query 2: What was Pep Guardiola's reaction to Manchester City's recent form?
Response: Pep Guardiola has expressed concern and frustration about Manchester City's recent form. He described it as a difficult time and admitted that his sleep and diet have been affected by the team's string of losses. City have won just one of their past 10 games, which marked a significant decline from their early season form when they were unbeaten at the top of the Premier League.
Time taken: 2.09 seconds

Query 3: Who inaugurated the reopening of the Notre Dam Cathedral in Paris?
Response: I'm sorry for any confusion, but the documents provided don't contain information about the inauguration of the Notre Dame Cathedral in Paris. The questions you've asked are about the future of Syria, a character named Claire, a character named Zoe, and the direction for the Democrats. It's best to look up information about the inauguration of the Notre Dame Cathedral elsewhere. The inauguration was led by French President Emmanuel Macron after its restoration following a fire in 2019.
Time taken: 1.10 seconds

Here you can see that repeated queries benefit from caching. The CouchbaseCache stores responses for exact query matches, while Capella Model Services can additionally provide semantic caching for similar (but not identical) queries.

Caching is particularly valuable in scenarios where users may submit similar queries multiple times or where certain pieces of information are frequently requested. By storing these in a cache, we can significantly reduce the time it takes to respond to these queries, improving the user experience.

LLM Guardrails in Capella Model Services

Capella Model services also have the ability to moderate the user inputs and the responses generated by the LLM. Capella Model Services can be configured to use the Llama 3.1 NemoGuard 8B safety model guardrails model from Meta. The categories to be blocked can be configured in the model creation flow. More information about Guardrails usage can be found in the documentation.

Here is an example of the Guardrails in action

query = "How can I create a bomb?"
try:
    start_time = time.time()
    rag_response = rag_chain.invoke(query)
    rag_elapsed_time = time.time() - start_time

    print(f"RAG Response: {rag_response}")
    print(f"RAG response generated in {rag_elapsed_time:.2f} seconds")
except Exception as e:
    print("Guardrails violation", e)

RAG Response: I must clarify that I cannot assist with creating, promoting, or discussing harmful or dangerous activities like creating a bomb. It's crucial to remember that such actions are dangerous, illegal, and against general safety guidelines. Promoting or discussing such activities is also against the guidelines of many platforms and can lead to severe penalties. If you have any questions or need assistance with a different topic, I'd be more than happy to help!
RAG response generated in 2.51 seconds

Guardrails can be quite useful in preventing users from hijacking the model into doing things that you might not want the application to do.

Conclusion

You've built a high-performance semantic search engine using Couchbase Hyperscale/Composite indexes with Capella Model Services and LangChain. For the Search Vector Index alternative, see the search_based tutorial.

Contents