To run this example project, you will need:
This tutorial will show how to use a Couchbase database cluster as an Langchain4j embedding storage.
Example source code for this tutorial can be obtained from Langchain4j examples github project. To do this, clone the repository using git:
git clone https://github.com/langchain4j/langchain4j-examples.git
cd langchain4j-examples/couchbase-example
Langchain4j is a framework library that simplifies integrating LLM-based services into Java applications. Additional information about the framework and its usage can be obtained at Langchain4j documentation website.
In Langchain4j, embedding stores are used to store vector embeddings that represent coordinates in an embedding space. The topology and dimensionality of the embedding space are defined by selected language model. Each coordinate in the space represents some kind of sentiment or idea and the closer any two embedding vectors are to each other, the closer to each other the ideas that they represent. By storing acquired from pretrained model embeddings in a dedicated storage, developers can greatly optimize the performance of their AI-based applications.
Couchbase langchain4j integration stores each embedding in a separate document and uses an FTS vector index to perform queries against stored vectors. Currently, it supports storing embeddings and their metadata, as well as removing embeddings. Filtering selected by vector search embeddings by their metadata was not supported at the moment of writing this tutorial. Please note that the embedding store integration is still under active development and the default configurations it comes with are not recommended for production usage.
A builder class can be used to initialize couchbase embedding store. The following parameters are required for initialization:
The following sample code illustrates how to initialize an embedding store that connects to a locally running Couchbase server:
CouchbaseEmbeddingStore embeddingStore = new CouchbaseEmbeddingStore.Builder("localhost:8091")
.username("Administrator")
.password("password")
.bucketName("langchain4j")
.scopeName("_default")
.collectionName("_default")
.searchIndexName("test")
.dimensions(512)
.build();
The sample source code provided with this tutorial uses a different approach and starts a dedicated to it Couchbase
server using testcontainers
library:
CouchbaseContainer couchbaseContainer =
new CouchbaseContainer(DockerImageName.parse("couchbase:enterprise").asCompatibleSubstituteFor("couchbase/server"))
.withCredentials("Administrator", "password")
.withBucket(testBucketDefinition)
.withStartupTimeout(Duration.ofMinutes(1));
CouchbaseEmbeddingStore embeddingStore = new CouchbaseEmbeddingStore.Builder(couchbaseContainer.getConnectionString())
.username(couchbaseContainer.getUsername())
.password(couchbaseContainer.getPassword())
.bucketName(testBucketDefinition.getName())
.scopeName("_default")
.collectionName("_default")
.searchIndexName("test")
.dimensions(384)
.build();
The embedding store uses an FTS vector index in order to perform vector similarity lookups. If provided with a name for vector index that does not exist on the cluster, the store will attempt to create a new index with default configuration based on the provided initialization settings. It is recommended to manually review the settings for the created index and adjust them according to specific use cases. More information about vector search and FTS index configuration can be found at Couchbase Documentation.
The integration automatically assigns unique UUID
-based identifiers to all stored embeddings. Here is
an example embedding document (with vector field values truncated for readability):
{
"id": "f4831648-07ca-4c77-a031-75acb6c1cf2f",
"vector": [
...
0.037255168,
-0.001608681
],
"text": "text",
"metadata": {
"some": "value"
},
"score": 0
}
These embeddings are generated with a selected by developers LLM and resulting vector values are model-specific.
Generated with a language model embeddings can be stored in couchbase using the add
method an instance of CouchbaseEmbeddingStore
class:
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
TextSegment segment1 = TextSegment.from("I like football.");
Embedding embedding1 = embeddingModel.embed(segment1).content();
embeddingStore.add(embedding1, segment1);
TextSegment segment2 = TextSegment.from("The weather is good today.");
Embedding embedding2 = embeddingModel.embed(segment2).content();
embeddingStore.add(embedding2, segment2);
Thread.sleep(1000); // to be sure that embeddings were persisted
After adding some embeddings into the store, a query vector can be used to find relevant to it embeddings in the store. Here, we're using the embedding model to generate a vector for the phrase "what is your favorite sport?". The obtained vector is then being used to find the most relevant answer in the database:
Embedding queryEmbedding = embeddingModel.embed("What is your favourite sport?").content();
List<EmbeddingMatch<TextSegment>> relevant = embeddingStore.findRelevant(queryEmbedding, 1);
EmbeddingMatch<TextSegment> embeddingMatch = relevant.get(0);
The relevancy score and text of the selected answer can then be printed to the application output:
System.out.println(embeddingMatch.score()); // 0.81442887
System.out.println(embeddingMatch.embedded().text()); // I like football.
Couchbase embedding store also supports removing embeddings by their identifiers, for example:
embeddingStore.remove(embeddingMatch.id())
Or, to remove all embeddings:
embeddingStore.removeAll();