Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

How to embed complex data into vector using langchain framework

How to embediing dynamic collection of lists

from langchain.embeddings import OpenAIEmbeddings

# Initialize the OpenAIEmbeddings wrapper
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Example dynamic collection of lists
documents = [
    ["Alpha is the first letter of the Greek alphabet", "It is used in physics as a constant"],
    ["Beta is the second letter of the Greek alphabet", "It is often associated with testing"],
    ["Gamma is the third letter", "It is used in gamma rays"],
]

# Function to embed a collection of lists dynamically
def embed_dynamic_collection(collection):
    all_embeddings = []  # To store embeddings for all lists
    for doc_list in collection:
        embeddings_batch = embeddings.embed_documents(doc_list)  # Embed each sublist
        all_embeddings.append(embeddings_batch)
    return all_embeddings

# Generate embeddings
result = embed_dynamic_collection(documents)

# Print embeddings for the second document in the second list
print("Embedding for second document in the second list:", result[1][1])
Enter fullscreen mode Exit fullscreen mode

output

Example: result
Enter fullscreen mode Exit fullscreen mode

For simplicity, we'll represent embeddings with shortened values:

[
    [
        [0.001, -0.002, 0.003, ..., 0.0012],  # Embedding for "Alpha is the first letter..."
        [0.004, -0.001, 0.002, ..., 0.0023]   # Embedding for "It is used in physics..."
    ],
    [
        [0.002, -0.003, 0.001, ..., 0.0017],  # Embedding for "Beta is the second letter..."
        [0.003, -0.004, 0.002, ..., 0.0028]   # Embedding for "It is often associated..."
    ],
    [
        [0.001, -0.005, 0.002, ..., 0.0014],  # Embedding for "Gamma is the third letter..."
        [0.002, -0.006, 0.003, ..., 0.0031]   # Embedding for "It is used in gamma rays"
    ]
]
Enter fullscreen mode Exit fullscreen mode
Accessing result[1][1]
Enter fullscreen mode Exit fullscreen mode

This retrieves the embedding for the second document in the second sublist:

[0.003, -0.004, 0.002, ..., 0.0028]
Enter fullscreen mode Exit fullscreen mode

How to embediing dynamic collection of lists after flattening

Flattening the Collection
If the original documents collection looks like this:

documents = [
    ["Alpha is the first letter of the Greek alphabet", "It is used in physics as a constant"],
    ["Beta is the second letter of the Greek alphabet", "It is often associated with testing"],
    ["Gamma is the third letter", "It is used in gamma rays"],
]
Enter fullscreen mode Exit fullscreen mode

After flattening, it will become:

flattened_documents = [
    "Alpha is the first letter of the Greek alphabet",
    "It is used in physics as a constant",
    "Beta is the second letter of the Greek alphabet",
    "It is often associated with testing",
    "Gamma is the third letter",
    "It is used in gamma rays",
]
Enter fullscreen mode Exit fullscreen mode

This allows you to embed all the text in one batch.

from langchain.embeddings import OpenAIEmbeddings

# Initialize the OpenAIEmbeddings wrapper
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Original collection of documents (nested list)
documents = [
    ["Alpha is the first letter of the Greek alphabet", "It is used in physics as a constant"],
    ["Beta is the second letter of the Greek alphabet", "It is often associated with testing"],
    ["Gamma is the third letter", "It is used in gamma rays"],
]

# Flatten the collection
flattened_documents = [doc for sublist in documents for doc in sublist]

# Generate embeddings for the flattened list
all_embeddings = embeddings.embed_documents(flattened_documents)

# Print information about the embeddings
print("Number of documents:", len(flattened_documents))
print("Embedding for the first document:", all_embeddings[0][:5])  # Print first 5 values of first embedding
print("Embedding for the second document:", all_embeddings[1][:5])  # Print first 5 values of sec
Enter fullscreen mode Exit fullscreen mode

Sample Output

Number of documents: 6
Embedding for the first document: [0.0012, -0.0023, 0.0034, 0.0045, -0.0011]
Embedding for the second document: [0.0021, -0.0034, 0.0015, 0.0032, -0.0008]
Enter fullscreen mode Exit fullscreen mode

Image description

How to embed a dynamic collection of JSON data

from langchain.embeddings import OpenAIEmbeddings

# Initialize OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Example JSON data
data = [
    {"id": 1, "title": "Alpha is the first letter", "description": "Used in physics as a constant"},
    {"id": 2, "title": "Beta is the second letter", "description": "Associated with software testing"},
    {"id": 3, "title": "Gamma is the third letter", "description": "Used in gamma rays"},
]

# Function to embed JSON data dynamically
def embed_json_data(json_data, fields_to_embed):
    embeddings_result = {}  # To store embeddings by ID or a unique key
    for item in json_data:
        combined_text = " ".join([str(item[field]) for field in fields_to_embed if field in item])
        embeddings_result[item["id"]] = embeddings.embed_documents([combined_text])[0]
    return embeddings_result

# Specify fields to embed
fields = ["title", "description"]

# Generate embeddings
embedded_data = embed_json_data(data, fields)

# Output embeddings for each JSON item
for id, embedding in embedded_data.items():
    print(f"ID: {id}, Embedding: {embedding[:5]}...")  # Show first 5 values for brevity
Enter fullscreen mode Exit fullscreen mode

How to handling nested JSON and Batch Processing

from langchain.embeddings import OpenAIEmbeddings

# Initialize OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Example Nested JSON Data
data = [
    {
        "id": 1,
        "title": "Alpha is the first letter",
        "description": "Used in physics as a constant",
        "extra": {"category": "Greek alphabet", "importance": "High"}
    },
    {
        "id": 2,
        "title": "Beta is the second letter",
        "description": "Associated with software testing",
        "extra": {"category": "Greek alphabet", "importance": "Medium"}
    },
    {
        "id": 3,
        "title": "Gamma is the third letter",
        "description": "Used in gamma rays",
        "extra": {"category": "Physics", "importance": "High"}
    },
]

# Recursive function to extract and combine text fields from nested JSON
def extract_text_from_json(json_obj, fields_to_embed):
    combined_text = []
    for field in fields_to_embed:
        if isinstance(json_obj, dict):
            # Check for nested dictionaries
            if field in json_obj and isinstance(json_obj[field], dict):
                combined_text.append(extract_text_from_json(json_obj[field], json_obj[field].keys()))
            elif field in json_obj:
                combined_text.append(str(json_obj[field]))
        elif isinstance(json_obj, list):
            # Handle lists by applying the function recursively
            for item in json_obj:
                combined_text.append(extract_text_from_json(item, fields_to_embed))
    return " ".join(combined_text)

# Function to process JSON data in batches
def embed_json_data_in_batches(json_data, fields_to_embed, batch_size=2):
    embeddings_result = {}
    batch = []
    for item in json_data:
        combined_text = extract_text_from_json(item, fields_to_embed)
        batch.append((item["id"], combined_text))

        # Process batch when it reaches batch_size
        if len(batch) == batch_size:
            ids, texts = zip(*batch)
            batch_embeddings = embeddings.embed_documents(texts)
            embeddings_result.update({id_: embedding for id_, embedding in zip(ids, batch_embeddings)})
            batch = []  # Reset the batch

    # Process any remaining items
    if batch:
        ids, texts = zip(*batch)
        batch_embeddings = embeddings.embed_documents(texts)
        embeddings_result.update({id_: embedding for id_, embedding in zip(ids, batch_embeddings)})

    return embeddings_result

# Specify fields to embed
fields = ["title", "description", "extra"]

# Generate embeddings in batches
embedded_data = embed_json_data_in_batches(data, fields, batch_size=2)

# Output embeddings for each JSON item
for id, embedding in embedded_data.items():
    print(f"ID: {id}, Embedding: {embedding[:5]}...")  # Show first 5 values for brevity
Enter fullscreen mode Exit fullscreen mode

Example Output
For the provided nested JSON, the output might look like this:

ID: 1, Embedding: [0.0012, -0.0023, 0.0031, 0.0045, -0.0011]...
ID: 2, Embedding: [0.0024, -0.0034, 0.0017, 0.0039, -0.0009]...
ID: 3, Embedding: [0.0036, -0.0018, 0.0028, 0.0051, -0.0013]...
Enter fullscreen mode Exit fullscreen mode

Function: extract_text_from_json
This function recursively extracts text from the specified fields in a JSON object, including nested dictionaries and lists.

Input Example:

json_obj = {
    "id": 1,
    "title": "Alpha is the first letter",
    "description": "Used in physics as a constant",
    "extra": {"category": "Greek alphabet", "importance": "High"}
}
fields_to_embed = ["title", "description", "extra"]
Enter fullscreen mode Exit fullscreen mode

Image description

SUMMARY

How to embediing dynamic collection of lists using OpenAIEmbeddings through class
Initialize the OpenAIEmbeddings wrapper
declare or define dynamic collection of lists
Function to embed a collection of lists dynamically==>create empty list To store embeddings for all lists==>apply for loop==
inside for loop Embed each sublist
How to embediing dynamic collection of lists after flattening
flattened_documents = [doc for sublist in documents for doc in sublist]
embeddings.embed_documents(flattened_documents)==> len(flattened_documents)===>all_embeddings[0][:5]
How to embed a dynamic collection of JSON data

Top comments (0)