Welcome to this comprehensive guide on constructing an AI-enhanced Chat Application. We will create a dynamic chat interface capable of delving into PDF documents to extract and provide summaries, key facts, and answers to your queries. By the end of this tutorial, you’ll have a powerful tool at your disposal, transforming the way you interact with and utilize the information contained within PDFs.
This tutorial will demonstrate how to:
Note that this tutorial is designed to work with the latest Python SDK version (4.2.0+) for Couchbase. It will not work with the older Python SDK versions.
git clone https://github.com/couchbase-examples/haystack-demo.git
Any dependencies should be installed through pip
, the default package manager for Python. You may use virtual environment as well.
python -m pip install -r requirements.txt
To know more about connecting to your Capella cluster, please follow the instructions.
Specifically, you need to do the following:
pdf-chat
. We will use the _default
scope and _default
collection of this bucket.We need to create the Search Index on the Full Text Service in Couchbase. For this demo, you can import the following index using the instructions.
You may also create a vector index using Search UI on both Couchbase Capella and Couchbase Self Managed Server.
Here, we are creating the index pdf_search
on the documents. The Vector field is set to embedding
with 1536 dimensions and the text field set to text
. We are also indexing and storing all the fields under metadata
in the document as a dynamic mapping to account for varying document structures. The similarity metric is set to dot_product
. If there is a change in these parameters, please adapt the index accordingly.
{
"name": "pdf_search",
"type": "fulltext-index",
"params": {
"doc_config": {
"docid_prefix_delim": "",
"docid_regexp": "",
"mode": "scope.collection.type_field",
"type_field": "type"
},
"mapping": {
"default_analyzer": "standard",
"default_datetime_parser": "dateTimeOptional",
"default_field": "_all",
"default_mapping": {
"dynamic": true,
"enabled": false
},
"default_type": "_default",
"docvalues_dynamic": false,
"index_dynamic": true,
"store_dynamic": false,
"type_field": "_type",
"types": {
"_default._default": {
"dynamic": true,
"enabled": true,
"properties": {
"embedding": {
"enabled": true,
"dynamic": false,
"fields": [
{
"dims": 1536,
"index": true,
"name": "embedding",
"similarity": "dot_product",
"type": "vector",
"vector_index_optimized_for": "recall"
}
]
},
"metadata": {
"dynamic": true,
"enabled": true
},
"text": {
"enabled": true,
"dynamic": false,
"fields": [
{
"index": true,
"name": "text",
"store": true,
"type": "text"
}
]
}
}
}
}
},
"store": {
"indexType": "scorch",
"segmentVersion": 16
}
},
"sourceType": "gocbcore",
"sourceName": "pdf-chat",
"sourceParams": {},
"planParams": {
"maxPartitionsPerPIndex": 64,
"indexPartitions": 16,
"numReplicas": 0
}
}
Copy the secrets.example.toml
file in .streamlit
folder and rename it to secrets.toml
and replace the placeholders with the actual values for your environment. All configuration for communication with the database is read from the environment variables.
DB_CONN_STR = "<couchbase_cluster_connection_string>"
DB_USERNAME = "<couchbase_username>"
DB_PASSWORD = "<couchbase_password>"
DB_BUCKET = "<bucket_name>"
DB_SCOPE = "<scope_name>"
DB_COLLECTION = "<collection_name>"
INDEX_NAME = "<vector_capable_fts_index_name>"
OPENAI_API_KEY = "<openai_api_key>"
OpenAI API Key is required for usage in generating embedding and querying LLM.
The connection string expects the
couchbases://
orcouchbase://
part.
For this tutorial,
DB_BUCKET = pdf-chat
,DB_SCOPE = _default
,DB_COLLECTION = _default
andINDEX_NAME = pdf_search
.
After starting Couchbase server, adding vector index and installing dependencies. Our Application is ready to run.
In the projects root directory, run the following command
streamlit run chat_with_pdf.py
The application will run on your local machine at http://localhost:8501.
On the left sidebar, you'll find an option to upload a PDF document you want to use with this PDF Chat App. Depending on the size of the PDF, the upload process may take some time.
In the main area, there's a chat screen where you can ask questions about the uploaded PDF document. You will receive two responses: one with context from the PDF (Couchbase Logo - ) , and one without the PDF context (Bot Logo - 🤖). This demonstrates how the Retrieval Augmented Generation (RAG) model enhances the answers provided by the language model using the PDF content.
The PDF Chat application leverages two powerful concepts: Retrieval-Augmented Generation (RAG) and Vector Search. Together, these techniques enable efficient and context-aware interactions with PDF documents.
RAG is like having two helpers:
Here's how RAG works:
This enhances the context from PDF and LLM is able to give relevant results from the PDF rather than giving generalized results.
Couchbase is a NoSQL database that provides a powerful Vector Search capability. It allows you to store and search through high-dimensional vector representations (embeddings) of textual data, such as PDF content.
The PDF Chat app uses LangChain to convert the text from the PDF documents into embeddings. These embeddings are then stored in a Couchbase bucket, along with the corresponding text.
When a user asks a question or provides a prompt:
Haystack is a powerful library that simplifies the process of building applications with large language models (LLMs) and vector stores like Couchbase.
In the PDF Chat app, Haystack is used for several tasks:
By combining Vector Search with Couchbase, RAG, and Haystack, the PDF Chat app can efficiently ingest PDF documents, convert their content into searchable embeddings, retrieve relevant information based on user queries and conversation context, and generate context-aware and informative responses using large language models.
To begin this tutorial, clone the repo and open it up in the IDE of your choice. Now you can learn how to create the PDF Chat App. The whole code is written in chat_with_pdf.py
file.
The fundamental workflow of the application is as follows: The user initiates the process from the Main Page's sidebar by uploading a PDF. This action triggers the save_to_vector_store
function, which subsequently uploads the PDF into the Couchbase vector store. Following this, the user can now chat with the LLM.
On the Chat Area, the user can pose questions. These inquiries are processed by the Chat API, which consults the LLM for responses, aided by the context provided by RAG. The assistant then delivers the answer, and the user has the option to ask additional questions.
The first step is connecting to Couchbase. Couchbase Vector Search is required for PDF Upload as well as during chat (For Retrieval). We will use the Haystack CouchbaseDocumentStore to connect to the Couchbase cluster. The connection is established in the get_document_store
function.
The connection string and credentials are read from the environment variables. We perform some basic required checks for the environment variable not being set in the secrets.toml
, and then proceed to connect to the Couchbase cluster. We connect to the cluster using connect method.
from couchbase_haystack import CouchbaseDocumentStore, CouchbasePasswordAuthenticator, CouchbaseClusterOptions
@st.cache_resource(show_spinner="Connecting to Vector Store")
def get_document_store():
"""Return the Couchbase document store"""
return CouchbaseDocumentStore(
cluster_connection_string=Secret.from_env_var("DB_CONN_STR"),
authenticator=CouchbasePasswordAuthenticator(
username=Secret.from_env_var("DB_USERNAME"),
password=Secret.from_env_var("DB_PASSWORD")
),
cluster_options=CouchbaseClusterOptions(profile='wan_development'),
bucket=os.getenv("DB_BUCKET"),
scope=os.getenv("DB_SCOPE"),
collection=os.getenv("DB_COLLECTION"),
vector_search_index=os.getenv("INDEX_NAME"),
)
We will define the bucket, scope, collection and index names from Environment Variables.
We will now initialize the CouchbaseDocumentStore which will be used for storing and retrieving document embeddings.
# Initialize document store
document_store = get_document_store()
The save_to_vector_store
function takes care of uploading the PDF file in vector format to the Couchbase Database using the Haystack indexing pipeline. It converts the PDF to documents, cleans them, splits text into small chunks, generates embeddings for those chunks, and ingests the chunks and their embeddings into the Couchbase vector store.
This part of code creates a file uploader on sidebar using Streamlit library. After PDF is uploaded, save_to_vector_store
function is called to further process the PDF.
with st.form("upload pdf"):
uploaded_file = st.file_uploader("Choose a PDF.", help="The document will be deleted after one hour of inactivity (TTL).", type="pdf")
submitted = st.form_submit_button("Upload")
if submitted:
save_to_vector_store(uploaded_file, indexing_pipeline)
The save_to_vector_store
function ensures that the uploaded PDF file is properly handled, loaded, and prepared for storage in the vector store. It first checks if a file was actually uploaded. Then the uploaded file is saved to a temporary file.
The indexing pipeline takes care of converting the PDF to documents, cleaning them, splitting them into chunks, generating embeddings, and storing them in the Couchbase vector store.
def save_to_vector_store(uploaded_file, indexing_pipeline):
"""Process the PDF & store it in Couchbase Vector Store"""
if uploaded_file is not None:
temp_dir = tempfile.TemporaryDirectory()
temp_file_path = os.path.join(temp_dir.name, uploaded_file.name)
with open(temp_file_path, "wb") as f:
f.write(uploaded_file.getvalue())
result = indexing_pipeline.run({"converter": {"sources": [temp_file_path]}})
st.info(f"PDF loaded into vector store: {result['writer']['documents_written']} documents indexed")
The indexing pipeline is created to handle the entire process of ingesting PDFs into the vector store. It includes the following components:
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", PyPDFToDocument())
indexing_pipeline.add_component("cleaner", DocumentCleaner())
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=250, split_overlap=30))
indexing_pipeline.add_component("embedder", OpenAIDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("converter.documents", "cleaner.documents")
indexing_pipeline.connect("cleaner.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")
After uploading the PDF into Couchbase, we are now ready to utilize the power of Couchbase Vector Search, RAG and LLM to get context based answers to our questions. When the user asks a question. The assistant (LLM) is called here with RAG context, the response from the assistant is sent back to the user.
We create a RAG (Retrieval-Augmented Generation) pipeline using Haystack components. This pipeline handles the entire process of retrieving relevant context from the vector store and generating responses using the LLM.
The OpenAIGenerator is a crucial component in our RAG pipeline, responsible for generating human-like responses based on the retrieved context and user questions. Here's a more detailed explanation of its configuration and role:
By using the OpenAIGenerator, we leverage state-of-the-art natural language processing capabilities to provide accurate, context-aware answers to user queries about the uploaded PDF content.
from haystack.components.embedders import OpenAIDocumentEmbedder
from couchbase_haystack import CouchbaseEmbeddingRetriever
from haystack.components.builders import PromptBuilder, AnswerBuilder
rag_pipeline = Pipeline()
rag_pipeline.add_component("query_embedder", OpenAIDocumentEmbedder())
rag_pipeline.add_component("retriever", CouchbaseEmbeddingRetriever(document_store=document_store))
rag_pipeline.add_component("prompt_builder", PromptBuilder(template="""
You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Question: {{question}}
"""))
rag_pipeline.add_component(
"llm",
OpenAIGenerator(
api_key=OPENAI_API_KEY,
model="gpt-4o",
),
)
rag_pipeline.add_component("answer_builder", AnswerBuilder())
rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")
This section creates an interactive chat interface where users can ask questions based on the uploaded PDF. The key steps are:
if question := st.chat_input("Ask a question based on the PDF"):
st.chat_message("user").markdown(question)
st.session_state.messages.append({"role": "user", "content": question, "avatar": "👤"})
# RAG response
with st.chat_message("assistant", avatar=couchbase_logo):
message_placeholder = st.empty()
rag_result = rag_pipeline.run(
{
"query_embedder": {"text": question},
"retriever": {"top_k": 3},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
rag_response = rag_result["answer_builder"]["answers"][0].data
message_placeholder.markdown(rag_response)
st.session_state.messages.append({"role": "assistant", "content": rag_response, "avatar": couchbase_logo})
This setup allows users to have a conversational experience, asking questions related to the uploaded PDF, with responses generated by the RAG pipeline. Both the user's questions and the assistant's responses are displayed in the chat interface, along with their respective roles and avatars.
Similar to last section, we will get answer from LLM of the user question. Answers from here are also shown in the UI to showcase difference on how using RAG gives better and more context enabled results.
# stream the response from the pure LLM
# Add placeholder for streaming the response
with st.chat_message("ai", avatar="🤖"):
message_placeholder_pure_llm = st.empty()
pure_llm_result = rag_pipeline.run(
{
"prompt_builder": {"question": question},
"llm": {},
"answer_builder": {"query": question},
"query_embedder": {"text": question}
}
)
pure_llm_response = pure_llm_result["answer_builder"]["answers"][0].data
message_placeholder_pure_llm.markdown(pure_llm_response)
st.session_state.messages.append({"role": "assistant", "content": pure_llm_response, "avatar": "🤖"})
Congratulations! You've successfully built a PDF Chat application using Couchbase, Haystack, and OpenAI. This application demonstrates the power of combining vector search capabilities with large language models to create an intelligent, context-aware chat interface for PDF documents.
Through this tutorial, you've learned how to:
We hope this tutorial has given you a solid foundation for building AI-powered document interaction systems with Couchbase and Haystack. Happy coding!