How to Storing and Querying Embeddings in ChromaDB Collections

What is a Collection in chromadb
Purpose of Using a Collection

What is a Collection in chromadb?

A collection in chromadb is a container or namespace that organizes and stores related data (e.g., documents, embeddings). It allows you to:

Group Related Data:

Documents and their corresponding embeddings are stored within a specific collection.
You can have multiple collections, each for different datasets or projects.
Facilitate Similarity Search:

Collections are designed to support vector search operations, where you query a vector and retrieve the most similar documents in the collection.
Manage Metadata:

Collections can store additional information (e.g., document IDs, metadata) to make queries more meaningful.

Purpose of Using a Collection

Storing Embeddings:

You compute embeddings for documents using a model like OpenAI’s text-embedding-ada-002 and store them in the collection.
Querying for Similarity:

Once embeddings are stored, you can query the collection with a new embedding (e.g., a search query) and retrieve similar documents based on cosine similarity or other distance metrics.
Efficient Retrieval:

Collections optimize the storage and retrieval of embeddings, making similarity search faster and scalable.

Adding Documents to the Collection

import openai
import chromadb

# Set your OpenAI API key
openai.api_key = "YOUR_OPENAI_API_KEY"

# Initialize Chroma client
client = chromadb.Client()

# Create a Chroma collection
collection = client.create_collection("documents")

# Documents to add
documents = [
    "Alpha is the first letter of the Greek alphabet.",
    "Beta is the second letter of the Greek alphabet.",
    "Gamma is the third letter."
]

# Compute embeddings using OpenAI API
def get_embedding(text):
    response = openai.Embedding.create(model="text-embedding-ada-002", input=text)
    return response['data'][0]['embedding']

# Add documents and their embeddings to the collection
for idx, doc in enumerate(documents):
    embedding = get_embedding(doc)
    collection.add(documents=[doc], embeddings=[embedding], ids=[str(idx)])

print("Documents added to the collection.")

Retrieve All Documents in the Collection

# Get all documents in the collection
all_documents = collection.get()
print("Documents in the collection:", all_documents)

Output:

Querying the Collection

# Query a similar document
query = "What is the first letter of the Greek alphabet?"
query_embedding = get_embedding(query)

# Perform a similarity search
results = collection.query(query_embeddings=[query_embedding], n_results=2)

print("Query Results:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
    print(f"Document: {doc}, Similarity Score: {score}")

Expected Output
After Adding Documents:

Documents added to the collection.
Query Results:

For the query "What is the first letter of the Greek alphabet?", the output might look like:

Query Results:

Document: Alpha is the first letter of the Greek alphabet., Similarity Score: 0.12
Document: Beta is the second letter of the Greek alphabet., Similarity Score: 0.45

Query the Collection for Similar Documents
You can query the collection for the most similar document to a given text.

# Query for similar documents
query = "What is the first letter of the Greek alphabet?"
query_embedding = get_embedding(query)

# Perform similarity search
results = collection.query(query_embeddings=[query_embedding], n_results=1)

print("Query Result:", results)

Output:

Query Result: {'ids': ['0'],
               'documents': ['Alpha is the first letter of the Greek alphabet.'],
               'distances': [0.12]}

Here:

ids: The ID of the most similar document.
documents: The most similar document.
distances: The similarity score (lower values indicate higher similarity).

Check Collection Metadata You can inspect the collection metadata to see how many documents and embeddings are stored.

# Check metadata
print("Number of documents in the collection:", len(collection.get()['documents']))

Output:

Number of documents in the collection: 3

SUMMARY

What is a Collection in chromadb?===>chromadb is a container or namespace ==>stores documents, embeddings
Group Related Data||Facilitate Similarity Search||Manage Metadata
Purpose of Using a Collection===>Storing Embeddings||Querying for Similarity||Efficient Retrieval:
Adding text data and vectordata(embedded data) to chroma collection
adding text===>chromadb.Client().create_collection("documents") where document contain list of text data
adding embedded data====>for idx, doc in enumerate(documents):===>embedding = get_embedding(doc)
collection.add(documents=[doc], embeddings=[embedding], ids=[str(idx)])==>collection outputs--> ids,documents,distances
Querying the Collection==>collection.query(query_embeddings=[query_embedding], n_results=2)
for doc, score in zip(results["documents"][0], results["distances"][0]):==>print(f"Document: {doc}, Similarity Score: {score}")
Query the Collection for Similar Documents===>collection.query(query_embeddings=[query_embedding], n_results=1) where n_results=1 most similar==>len(collection.get()['documents'])

Debug School

How to Storing and Querying Embeddings in ChromaDB Collections

What is a Collection in chromadb?

Purpose of Using a Collection

SUMMARY

Top comments (0)