rakesh kumar

Posted on Nov 13, 2024 • Edited on Nov 23, 2024

List out data Transformer commands using Langchain framework

how to get list of text string using langchain

text= """
LangChain provides various utilities for splitting large text documents.
TextSplitter is a powerful tool that allows splitting text into manageable chunks.
This helps in natural language processing tasks by limiting the input length.
Each chunk generated can then be used individually for processing.
You can customize the splitting behavior with options like separators and chunk sizes.
LangChain also supports recursive splitting for more control.
The library is designed to handle complex NLP workflows with ease.
LangChain is flexible, modular, and easy to integrate with other systems.
"""

Basic Character Text Splitter
Explanation: This splits a long text based on a character separator (\n), creating chunks of 500 characters each with an overlap of 100 characters between chunks.

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator="\n", chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]

Splitting by Words
Explanation: Splits text by words, creating chunks with up to 300 words and a 50-word overlap between chunks.

from langchain.text_splitter import WordTextSplitter

text_splitter = WordTextSplitter(chunk_size=20, chunk_overlap=5)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents TextSplitter",
    "provides various utilities for splitting large text documents TextSplitter is a powerful tool",
    "large text documents TextSplitter is a powerful tool that allows splitting text into manageable chunks."
]

Splitting by Sentence
Explanation: Splits text by sentences, making chunks of up to 5 sentences, with a 1-sentence overlap.

from langchain.text_splitter import SentenceTextSplitter

text_splitter = SentenceTextSplitter(chunk_size=1, chunk_overlap=0)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]

Splitting by Token Count
Assuming each word is roughly one token.
Explanation: Splits text by token count, creating chunks of up to 200 tokens with a 20-token overlap.

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]

Recursive Character Text Splitter (multiple separators)
Explanation: Splits text recursively based on a priority of separators (\n\n, \n, space, and no separator), creating chunks of 500 characters with 100-character overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]

Customized Recursive Character Text Splitter with Custom Separators
Explanation: Uses a custom order of separators, prioritizing periods (.) and double newlines (\n\n), with chunks of 400 characters and a 50-character overlap.

text_splitter = RecursiveCharacterTextSplitter(separators=[".", "\n\n", " ", ""], chunk_size=80, chunk_overlap=20)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents. TextSplitter",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks. This helps",
    "in natural language processing tasks by limiting the input length."
]

Splitting for Summarization (Shorter Chunks) Explanation: Creates shorter chunks for summarization, with 100-character chunks and a 20-character overlap.

text_splitter = CharacterTextSplitter(separator="\n", chunk_size=50, chunk_overlap=10)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]

Splitting JSON Documents by Key (For NLP Tasks)
Explanation: Splits JSON documents by a specified key (e.g., content) with chunks of 300 words and a 50-word overlap.

from langchain.text_splitter import JsonTextSplitter

text_splitter = JsonTextSplitter(chunk_key="content", chunk_size=20, chunk_overlap=5)
json_data = [{"content": text}]
chunks = text_splitter.split_text(json_data)

Output:

[
    {"content": "LangChain is a powerful library for splitting text."},
    {"content": "It is widely used in NLP."}
]

Splitting Based on Paragraphs (For Large Text Blocks)
Explanation: Splits text by paragraphs (\n\n) with a 600-character chunk size and no overlap.

text_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=100, chunk_overlap=0)
chunks = text_splitter.split_text(text)

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks."
]

Dynamic Chunk Size and Overlap Based on Text Length
Explanation: Dynamically adjusts chunk size based on text length, with a 200-character chunk for shorter text (<1000 characters) and a 500-character chunk for longer text, and 50-character overlap.

chunk_size = 50 if len(text) < 500 else 100
text_splitter = CharacterTextSplitter(separator=" ", chunk_size=chunk_size, chunk_overlap=10)
chunks = text_splitter.split_text(text)

Output(for shorter chunk size since text length is less than 500):

[
    "LangChain provides various utilities for splitting large text documents. TextSplitter is a powerful tool.",
    "This helps in natural language processing tasks by limiting the input length. Each chunk generated can",
    "then be used individually for processing. You can customize the splitting behavior with options like",
    "separators and chunk sizes."
]

from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
docs=loader.load()
docs

output

[Document(page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make.happiness and the peace which she has treasured. God helping her, she can do no other.', metadata={'source': 'speech.txt'})]

from langchain_text_splitters import CharacterTextSplitter
text_splitter=CharacterTextSplitter(separator="\n\n",chunk_size=100,chunk_overlap=20)
text_splitter.split_documents(docs)

output

[Document(page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind..', metadata={'source': 'speech.txt'})]

How to print first element of list

speech=""
with open("speech.txt") as f:
    speech=f.read()


text_splitter=CharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech])
print(text[0])
print(text[1])

Document(page_content="LangChain is a library for building applications with large language models. It provides utilities for text splitting, text processing, and vari")

How to get list of string text on specific filtering condition

from langchain.text_splitter import CharacterTextSplitter

# Step 1: Load your long text
text = """
LangChain provides various utilities for splitting large text documents.
TextSplitter is a powerful tool that allows splitting text into manageable chunks.
This helps in natural language processing tasks by limiting the input length.
Each chunk generated can then be used individually for processing.
You can customize the splitting behavior with options like separators and chunk sizes.
LangChain also supports recursive splitting for more control.
The library is designed to handle complex NLP workflows with ease.
LangChain is flexible, modular, and easy to integrate with other systems.
"""

# Step 2: Split the text into chunks
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=100, chunk_overlap=10)
chunks = text_splitter.split_text(text)

# Step 3: Extract specific parts from chunks

# 1. Get First Chunk After Splitting
first_chunk = chunks[0] if chunks else None
print("First Chunk:", first_chunk)

# 2. Get Last Chunk After Splitting
last_chunk = chunks[-1] if chunks else None
print("Last Chunk:", last_chunk)

# 3. Get Chunks with Specific Word (e.g., "LangChain")
specific_chunks = [chunk for chunk in chunks if "LangChain" in chunk]
print("Chunks Containing 'LangChain':", specific_chunks)

# 4. Get Chunks Longer than a Certain Length (e.g., 100 characters)
long_chunks = [chunk for chunk in chunks if len(chunk) > 100]
print("Chunks Longer than 100 Characters:", long_chunks)

# 5. Get Every Other Chunk (Odd-Indexed Chunks)
odd_chunks = [chunk for i, chunk in enumerate(chunks) if i % 2 == 1]
print("Odd-Indexed Chunks:", odd_chunks)

# 6. Get Only the First 5 Chunks
first_five_chunks = chunks[:5]
print("First 5 Chunks:", first_five_chunks)

# 7. Get Chunks Containing a Specific Phrase (e.g., "AI Model")
phrase_chunks = [chunk for chunk in chunks if "AI Model" in chunk]
print("Chunks Containing 'AI Model':", phrase_chunks)

# 8. Get All Chunks Except the First and Last
middle_chunks = chunks[1:-1] if len(chunks) > 2 else []
print("Middle Chunks (excluding first and last):", middle_chunks)

# 9. Get Chunks with a Certain Range of Indices (e.g., Indices 2 to 6)
range_chunks = chunks[2:7]
print("Chunks from Index 2 to 6:", range_chunks)

# 10. Extract Specific Words from Each Chunk (e.g., First Word in Each Chunk)
first_words = [chunk.split()[0] for chunk in chunks if chunk.split()]
print("First Words from Each Chunk:", first_words)

Output

Get First Chunk After Splitting: After splitting, the first chunk will be the first portion of the text up to the chunk_size (100 characters). The first chunk will likely contain the first sentence or part of the text.

Expected Output:

First Chunk: "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks."

Get Last Chunk After Splitting: The last chunk will be the remaining part of the text after all other chunks have been split, starting from the chunk overlap and ending with the final portion of the text.

Expected Output:

Last Chunk: "LangChain is flexible, modular, and easy to integrate with other systems."

Get Chunks with Specific Word (e.g., "LangChain"): This will filter out chunks that contain the word "LangChain".

Expected Output:

Chunks Containing 'LangChain': [
    "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "LangChain also supports recursive splitting for more control.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]

Get Chunks Longer than a Certain Length (e.g., 100 characters): This will return chunks that are longer than 100 characters. Since the chunk size is set to 100 with an overlap of 10, this might return chunks where the overlap creates longer chunks.

Expected Output:

Chunks Longer than 100 Characters: [
    "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "You can customize the splitting behavior with options like separators and chunk sizes.\nLangChain also supports recursive splitting for more control."
]

Get Every Other Chunk (Odd-Indexed Chunks): This will return every second chunk, starting from the second chunk (index 1).

Expected Output:

Odd-Indexed Chunks: [
    "LangChain also supports recursive splitting for more control.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]

Get Only the First 5 Chunks: This will return the first 5 chunks from the split text.

Expected Output:

First 5 Chunks: [
    "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "You can customize the splitting behavior with options like separators and chunk sizes.\nLangChain also supports recursive splitting for more control.",
    "The library is designed to handle complex NLP workflows with ease.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]

Get Chunks Containing a Specific Phrase (e.g., "AI Model"): Since "AI Model" is not present in the provided text, this will return an empty list.

Expected Output:

Chunks Containing 'AI Model': []

Get All Chunks Except the First and Last: This will return all chunks except the first and last one. If there are fewer than 3 chunks, this will return an empty list.

Expected Output:

Middle Chunks (excluding first and last): [
    "You can customize the splitting behavior with options like separators and chunk sizes.\nLangChain also supports recursive splitting for more control.",
    "The library is designed to handle complex NLP workflows with ease."
]

Get Chunks with a Certain Range of Indices (e.g., Indices 2 to 6): This will return the chunks with indices from 2 to 6, inclusive. Since there are fewer than 6 chunks, this will return a subset of the available chunks from indices 2 to 4.

Expected Output:

Chunks from Index 2 to 6: [
    "The library is designed to handle complex NLP workflows with ease.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]

Extract Specific Words from Each Chunk (e.g., First Word in Each Chunk): This will extract the first word from each chunk. The first word is obtained by splitting each chunk and taking the first element.

Expected Output:

First Words from Each Chunk: [
    'LangChain', 'You', 'The', 'LangChain'
]

Explanation of Each Step

Step 1: Load the text to be split.
Step 2: Use CharacterTextSplitter with parameters to split the text based on new lines (\n) and define the chunk size and overlap.
Step 3: Use list comprehensions or indexing to extract specific parts of the chunks list, such as:
The first and last chunk
Chunks with specific keywords
Chunks with a certain length
Odd-indexed chunks
First 5 chunks
Chunks containing specific phrases
All chunks except the first and last
Chunks within a specified range
The first word from each chunk

Basic Text Splitting by Character Count

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(sample_text)
print(chunks)

Output:

['LangChain is a powerful library for building applicatio', 'ns with language models. It supports text splitting, tokeniz', 'ation, metadata handling, and more. With LangChain, you can p', 'rocess large documents and work with structured text effectiv', 'ely.']

Recursive Text Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(separators=["\n", ".", " "], chunk_size=40)
chunks = recursive_splitter.split_text(sample_text)
print(chunks)

Output:

['LangChain is a powerful library for building applications with language models', 'It supports text splitting, tokenization, metadata handling, and more', 'With LangChain, you can process large documents and work with structured text effectively']

Token-Based Text Splitting

from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(chunk_size=10)
chunks = token_splitter.split_text(sample_text)
print(chunks)

Output:

['LangChain', 'is', 'a', 'powerful', 'library', 'for', 'building', 'applications', 'with', 'language', 'models', 'It', 'supports', 'text', 'splitting', 'tokenization', 'metadata', 'handling', 'and', 'more', 'With', 'LangChain', 'you', 'can', 'process', 'large', 'documents', 'and', 'work', 'with', 'structured', 'text', 'effectively']

Sentence-Based Text Splitting

from langchain.text_splitter import SentenceTextSplitter

sentence_splitter = SentenceTextSplitter(chunk_size=1)  # Each chunk is a sentence
chunks = sentence_splitter.split_text(sample_text)
print(chunks)

Output:

['LangChain is a powerful library for building applications with language models.',
 'It supports text splitting, tokenization, metadata handling, and more.',
 'With LangChain, you can process large documents and work with structured text effectively.']

Text Splitting with Metadata Addition

splitter = CharacterTextSplitter(chunk_size=50)
texts = splitter.split_text(sample_text)
docs = [{"text": chunk, "metadata": {"source": "sample"}} for chunk in texts]
print(docs)

Output:

[{'text': 'LangChain is a powerful library for building applicatio', 'metadata': {'source': 'sample'}},  {'text': 'ns with language models. It supports text splitting, tokeniz', 'metadata': {'source': 'sample'}},  {'text': 'ation, metadata handling, and more. With LangChain, you can p', 'metadata': {'source': 'sample'}},  {'text': 'rocess large documents and work with structured text effectiv', 'metadata': {'source': 'sample'}},  {'text': 'ely.', 'metadata': {'source': 'sample'}}]

HTML-Based Header Splitting

from langchain.text_splitter import HTMLHeaderTextSplitter

html_text = "<h1>Title</h1><p>This is the first paragraph.</p><h2>Subheading</h2><p>Another paragraph.</p>"
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")])
chunks = html_splitter.split_text(html_text)
print(chunks)

Output:

['<h1>Title</h1><p>This is the first paragraph.</p>', 
 '<h2>Subheading</h2><p>Another paragraph.</p>']

Create Documents with Custom Metadata

from datetime import datetime

splitter = CharacterTextSplitter(chunk_size=50)
texts = splitter.split_text(sample_text)
docs = [{"text": chunk, "metadata": {"timestamp": datetime.now().isoformat(), "author": "Admin"}} for chunk in texts]
print(docs)

Output:

[{'text': 'LangChain is a powerful library for building applicatio', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'ns with language models. It supports text splitting, tokeniz', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'ation, metadata handling, and more. With LangChain, you can p', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'rocess large documents and work with structured text effectiv', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'ely.', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}}]

JSON-Based Text Splitting

from langchain.text_splitter import JsonTextSplitter

json_data = [{"content": sample_text}]
json_splitter = JsonTextSplitter(chunk_key="content", chunk_size=50)
chunks = json_splitter.split_text(json_data)
print(chunks)

Output:

[{'content': 'LangChain is a powerful library for building applicatio'},  {'content': 'ns with language models. It supports text splitting, tokeniz'},  {'content': 'ation, metadata handling, and more. With LangChain, you can p'},  {'content': 'rocess large documents and work with structured text effectiv'},  {'content': 'ely.'}]

Sequential Numbering for Chunks

splitter = CharacterTextSplitter(chunk_size=50)
texts = splitter.split_text(sample_text)
numbered_chunks = [f"{i+1}: {chunk}" for i, chunk in enumerate(texts)]
print(numbered_chunks)

Output:

['1: LangChain is a powerful library for building applicatio', 
 '2: ns with language models. It supports text splitting, tokeniz', 
 '3: ation, metadata handling, and more. With LangChain, you can p', 
 '4: rocess large documents and work with structured text effectiv', 
 '5: ely.']

Flatten Nested JSON Before Splitting

from langchain.text_splitter import JsonTextSplitter

nested_json = {"data": {"sections": [{"content": sample_text}]}}
json_splitter = JsonTextSplitter(chunk_key="content", flatten_json=True)
chunks = json_splitter.split_text(nested_json)
print(chunks)

Output:

[{'content': 'LangChain is a powerful library for building applicatio'},  {'content': 'ns with language models. It supports text splitting, tokeniz'},  {'content': 'ation, metadata handling, and more. With LangChain, you can p'},  {'content': 'rocess large documents and work with structured text effectiv'},  {'content': 'ely.'}]

SUMMARY

how to get list of text string using langchain

Basic Character Text Splitter==CharacterTextSplitter ||
Splitting by Words==WordTextSplitter||Splitting by Sentence==SentenceTextSplitter
Splitting by Token Count||
Customized Recursive Character Text Splitter with Custom Separators

How to print first element of list after split==text[0],text[1]

How to get list of string text on specific filtering condition====

First Chunk/first element of list(chunks[0])====Last Chunk After Splitting(chunks[-1])

Chunks with Specific Word==chunk for chunk in chunks if "LangChain" in chunk

Chunks Longer than a Certain Length==[chunk for chunk in chunks if len(chunk) > 100]

First 5 Chunks(chunks[:5])

Extract Specific Words from Each Chunk (e.g., First Word in Each Chunk)==[chunk.split()[0] for chunk in chunks if chunk.split()]

Sequential Numbering for Chunks==numbered_chunks = [f"{i+1}: {chunk}" for i, chunk in enumerate(texts)]

jsonsplitting==json_data = [{"content": sample_text}] ||json_splitter = JsonTextSplitter(chunk_key="content", chunk_size=50)

Debug School

List out data Transformer commands using Langchain framework

how to get list of text string using langchain

How to get list of string text on specific filtering condition

SUMMARY

Top comments (0)