What is a Collection in chromadb
Purpose of Using a Collection
What is a Collection in chromadb?
A collection in chromadb is a container or namespace that organizes and stores related data (e.g., documents, embeddings). It allows you to:
Group Related Data:
Documents and their corresponding embeddings are stored within a specific collection.
You can have multiple collections, each for different datasets or projects.
Facilitate Similarity Search:
Collections are designed to support vector search operations, where you query a vector and retrieve the most similar documents in the collection.
Manage Metadata:
Collections can store additional information (e.g., document IDs, metadata) to make queries more meaningful.
Purpose of Using a Collection
Storing Embeddings:
You compute embeddings for documents using a model like OpenAI’s text-embedding-ada-002 and store them in the collection.
Querying for Similarity:
Once embeddings are stored, you can query the collection with a new embedding (e.g., a search query) and retrieve similar documents based on cosine similarity or other distance metrics.
Efficient Retrieval:
Collections optimize the storage and retrieval of embeddings, making similarity search faster and scalable.
Adding Documents to the Collection
import openai
import chromadb
# Set your OpenAI API key
openai.api_key = "YOUR_OPENAI_API_KEY"
# Initialize Chroma client
client = chromadb.Client()
# Create a Chroma collection
collection = client.create_collection("documents")
# Documents to add
documents = [
"Alpha is the first letter of the Greek alphabet.",
"Beta is the second letter of the Greek alphabet.",
"Gamma is the third letter."
]
# Compute embeddings using OpenAI API
def get_embedding(text):
response = openai.Embedding.create(model="text-embedding-ada-002", input=text)
return response['data'][0]['embedding']
# Add documents and their embeddings to the collection
for idx, doc in enumerate(documents):
embedding = get_embedding(doc)
collection.add(documents=[doc], embeddings=[embedding], ids=[str(idx)])
print("Documents added to the collection.")
Retrieve All Documents in the Collection
# Get all documents in the collection
all_documents = collection.get()
print("Documents in the collection:", all_documents)
Output:
Querying the Collection
# Query a similar document
query = "What is the first letter of the Greek alphabet?"
query_embedding = get_embedding(query)
# Perform a similarity search
results = collection.query(query_embeddings=[query_embedding], n_results=2)
print("Query Results:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
print(f"Document: {doc}, Similarity Score: {score}")
Expected Output
After Adding Documents:
Documents added to the collection.
Query Results:
For the query "What is the first letter of the Greek alphabet?", the output might look like:
Query Results:
Document: Alpha is the first letter of the Greek alphabet., Similarity Score: 0.12
Document: Beta is the second letter of the Greek alphabet., Similarity Score: 0.45
Query the Collection for Similar Documents
You can query the collection for the most similar document to a given text.
# Query for similar documents
query = "What is the first letter of the Greek alphabet?"
query_embedding = get_embedding(query)
# Perform similarity search
results = collection.query(query_embeddings=[query_embedding], n_results=1)
print("Query Result:", results)
Output
:
Query Result: {'ids': ['0'],
'documents': ['Alpha is the first letter of the Greek alphabet.'],
'distances': [0.12]}
Here:
ids: The ID of the most similar document.
documents: The most similar document.
distances: The similarity score (lower values indicate higher similarity).
- Check Collection Metadata You can inspect the collection metadata to see how many documents and embeddings are stored.
# Check metadata
print("Number of documents in the collection:", len(collection.get()['documents']))
Output
:
Number of documents in the collection: 3
SUMMARY
What is a Collection in chromadb?===>chromadb is a container or namespace ==>stores documents, embeddings
Group Related Data||Facilitate Similarity Search||Manage Metadata
Purpose of Using a Collection===>Storing Embeddings||Querying for Similarity||Efficient Retrieval:
Adding text data and vectordata(embedded data) to chroma collection
adding text===>chromadb.Client().create_collection("documents") where document contain list of text data
adding embedded data====>for idx, doc in enumerate(documents):===>embedding = get_embedding(doc)
collection.add(documents=[doc], embeddings=[embedding], ids=[str(idx)])==>collection outputs--> ids,documents,distances
Querying the Collection==>collection.query(query_embeddings=[query_embedding], n_results=2)
for doc, score in zip(results["documents"][0], results["distances"][0]):==>print(f"Document: {doc}, Similarity Score: {score}")
Query the Collection for Similar Documents===>collection.query(query_embeddings=[query_embedding], n_results=1) where n_results=1 most similar==>len(collection.get()['documents'])
Top comments (0)