Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

List out data Transformer commands using Langchain framework

how to get list of text string using langchain

text= """
LangChain provides various utilities for splitting large text documents.
TextSplitter is a powerful tool that allows splitting text into manageable chunks.
This helps in natural language processing tasks by limiting the input length.
Each chunk generated can then be used individually for processing.
You can customize the splitting behavior with options like separators and chunk sizes.
LangChain also supports recursive splitting for more control.
The library is designed to handle complex NLP workflows with ease.
LangChain is flexible, modular, and easy to integrate with other systems.
"""

Basic Character Text Splitter
Explanation: This splits a long text based on a character separator (\n), creating chunks of 500 characters each with an overlap of 100 characters between chunks.

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator="\n", chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]
Enter fullscreen mode Exit fullscreen mode

Splitting by Words
Explanation: Splits text by words, creating chunks with up to 300 words and a 50-word overlap between chunks.

from langchain.text_splitter import WordTextSplitter

text_splitter = WordTextSplitter(chunk_size=20, chunk_overlap=5)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents TextSplitter",
    "provides various utilities for splitting large text documents TextSplitter is a powerful tool",
    "large text documents TextSplitter is a powerful tool that allows splitting text into manageable chunks."
]
Enter fullscreen mode Exit fullscreen mode

Splitting by Sentence
Explanation: Splits text by sentences, making chunks of up to 5 sentences, with a 1-sentence overlap.

from langchain.text_splitter import SentenceTextSplitter

text_splitter = SentenceTextSplitter(chunk_size=1, chunk_overlap=0)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]
Enter fullscreen mode Exit fullscreen mode

Splitting by Token Count
Assuming each word is roughly one token.
Explanation: Splits text by token count, creating chunks of up to 200 tokens with a 20-token overlap.

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]
Enter fullscreen mode Exit fullscreen mode

Recursive Character Text Splitter (multiple separators)
Explanation: Splits text recursively based on a priority of separators (\n\n, \n, space, and no separator), creating chunks of 500 characters with 100-character overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=100, chunk_overlap=20)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]
Enter fullscreen mode Exit fullscreen mode

Customized Recursive Character Text Splitter with Custom Separators
Explanation: Uses a custom order of separators, prioritizing periods (.) and double newlines (\n\n), with chunks of 400 characters and a 50-character overlap.

text_splitter = RecursiveCharacterTextSplitter(separators=[".", "\n\n", " ", ""], chunk_size=80, chunk_overlap=20)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents. TextSplitter",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks. This helps",
    "in natural language processing tasks by limiting the input length."
]
Enter fullscreen mode Exit fullscreen mode
  1. Splitting for Summarization (Shorter Chunks) Explanation: Creates shorter chunks for summarization, with 100-character chunks and a 20-character overlap.
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=50, chunk_overlap=10)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "This helps in natural language processing tasks by limiting the input length."
]
Enter fullscreen mode Exit fullscreen mode

Splitting JSON Documents by Key (For NLP Tasks)
Explanation: Splits JSON documents by a specified key (e.g., content) with chunks of 300 words and a 50-word overlap.

from langchain.text_splitter import JsonTextSplitter

text_splitter = JsonTextSplitter(chunk_key="content", chunk_size=20, chunk_overlap=5)
json_data = [{"content": text}]
chunks = text_splitter.split_text(json_data)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    {"content": "LangChain is a powerful library for splitting text."},
    {"content": "It is widely used in NLP."}
]
Enter fullscreen mode Exit fullscreen mode

Splitting Based on Paragraphs (For Large Text Blocks)
Explanation: Splits text by paragraphs (\n\n) with a 600-character chunk size and no overlap.

text_splitter = CharacterTextSplitter(separator="\n\n", chunk_size=100, chunk_overlap=0)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "LangChain provides various utilities for splitting large text documents.",
    "TextSplitter is a powerful tool that allows splitting text into manageable chunks."
]
Enter fullscreen mode Exit fullscreen mode

Dynamic Chunk Size and Overlap Based on Text Length
Explanation: Dynamically adjusts chunk size based on text length, with a 200-character chunk for shorter text (<1000 characters) and a 500-character chunk for longer text, and 50-character overlap.

chunk_size = 50 if len(text) < 500 else 100
text_splitter = CharacterTextSplitter(separator=" ", chunk_size=chunk_size, chunk_overlap=10)
chunks = text_splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

Output(for shorter chunk size since text length is less than 500):

[
    "LangChain provides various utilities for splitting large text documents. TextSplitter is a powerful tool.",
    "This helps in natural language processing tasks by limiting the input length. Each chunk generated can",
    "then be used individually for processing. You can customize the splitting behavior with options like",
    "separators and chunk sizes."
]
Enter fullscreen mode Exit fullscreen mode
from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
docs=loader.load()
docs
Enter fullscreen mode Exit fullscreen mode

output

[Document(page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make.happiness and the peace which she has treasured. God helping her, she can do no other.', metadata={'source': 'speech.txt'})]
Enter fullscreen mode Exit fullscreen mode
from langchain_text_splitters import CharacterTextSplitter
text_splitter=CharacterTextSplitter(separator="\n\n",chunk_size=100,chunk_overlap=20)
text_splitter.split_documents(docs)
Enter fullscreen mode Exit fullscreen mode

output

[Document(page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind..', metadata={'source': 'speech.txt'})]

How to print first element of list

speech=""
with open("speech.txt") as f:
    speech=f.read()


text_splitter=CharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech])
print(text[0])
print(text[1])
Enter fullscreen mode Exit fullscreen mode
Document(page_content="LangChain is a library for building applications with large language models. It provides utilities for text splitting, text processing, and vari")
Enter fullscreen mode Exit fullscreen mode

How to get list of string text on specific filtering condition

from langchain.text_splitter import CharacterTextSplitter

# Step 1: Load your long text
text = """
LangChain provides various utilities for splitting large text documents.
TextSplitter is a powerful tool that allows splitting text into manageable chunks.
This helps in natural language processing tasks by limiting the input length.
Each chunk generated can then be used individually for processing.
You can customize the splitting behavior with options like separators and chunk sizes.
LangChain also supports recursive splitting for more control.
The library is designed to handle complex NLP workflows with ease.
LangChain is flexible, modular, and easy to integrate with other systems.
"""

# Step 2: Split the text into chunks
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=100, chunk_overlap=10)
chunks = text_splitter.split_text(text)

# Step 3: Extract specific parts from chunks

# 1. Get First Chunk After Splitting
first_chunk = chunks[0] if chunks else None
print("First Chunk:", first_chunk)

# 2. Get Last Chunk After Splitting
last_chunk = chunks[-1] if chunks else None
print("Last Chunk:", last_chunk)

# 3. Get Chunks with Specific Word (e.g., "LangChain")
specific_chunks = [chunk for chunk in chunks if "LangChain" in chunk]
print("Chunks Containing 'LangChain':", specific_chunks)

# 4. Get Chunks Longer than a Certain Length (e.g., 100 characters)
long_chunks = [chunk for chunk in chunks if len(chunk) > 100]
print("Chunks Longer than 100 Characters:", long_chunks)

# 5. Get Every Other Chunk (Odd-Indexed Chunks)
odd_chunks = [chunk for i, chunk in enumerate(chunks) if i % 2 == 1]
print("Odd-Indexed Chunks:", odd_chunks)

# 6. Get Only the First 5 Chunks
first_five_chunks = chunks[:5]
print("First 5 Chunks:", first_five_chunks)

# 7. Get Chunks Containing a Specific Phrase (e.g., "AI Model")
phrase_chunks = [chunk for chunk in chunks if "AI Model" in chunk]
print("Chunks Containing 'AI Model':", phrase_chunks)

# 8. Get All Chunks Except the First and Last
middle_chunks = chunks[1:-1] if len(chunks) > 2 else []
print("Middle Chunks (excluding first and last):", middle_chunks)

# 9. Get Chunks with a Certain Range of Indices (e.g., Indices 2 to 6)
range_chunks = chunks[2:7]
print("Chunks from Index 2 to 6:", range_chunks)

# 10. Extract Specific Words from Each Chunk (e.g., First Word in Each Chunk)
first_words = [chunk.split()[0] for chunk in chunks if chunk.split()]
print("First Words from Each Chunk:", first_words)
Enter fullscreen mode Exit fullscreen mode

Output

  1. Get First Chunk After Splitting: After splitting, the first chunk will be the first portion of the text up to the chunk_size (100 characters). The first chunk will likely contain the first sentence or part of the text.

Expected Output:

First Chunk: "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks."
Enter fullscreen mode Exit fullscreen mode
  1. Get Last Chunk After Splitting: The last chunk will be the remaining part of the text after all other chunks have been split, starting from the chunk overlap and ending with the final portion of the text.

Expected Output:

Last Chunk: "LangChain is flexible, modular, and easy to integrate with other systems."
Enter fullscreen mode Exit fullscreen mode
  1. Get Chunks with Specific Word (e.g., "LangChain"): This will filter out chunks that contain the word "LangChain".

Expected Output:

Chunks Containing 'LangChain': [
    "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "LangChain also supports recursive splitting for more control.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]
Enter fullscreen mode Exit fullscreen mode
  1. Get Chunks Longer than a Certain Length (e.g., 100 characters): This will return chunks that are longer than 100 characters. Since the chunk size is set to 100 with an overlap of 10, this might return chunks where the overlap creates longer chunks.

Expected Output:

Chunks Longer than 100 Characters: [
    "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "You can customize the splitting behavior with options like separators and chunk sizes.\nLangChain also supports recursive splitting for more control."
]
Enter fullscreen mode Exit fullscreen mode
  1. Get Every Other Chunk (Odd-Indexed Chunks): This will return every second chunk, starting from the second chunk (index 1).

Expected Output:

Odd-Indexed Chunks: [
    "LangChain also supports recursive splitting for more control.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]
Enter fullscreen mode Exit fullscreen mode
  1. Get Only the First 5 Chunks: This will return the first 5 chunks from the split text.

Expected Output:

First 5 Chunks: [
    "LangChain provides various utilities for splitting large text documents.\nTextSplitter is a powerful tool that allows splitting text into manageable chunks.",
    "You can customize the splitting behavior with options like separators and chunk sizes.\nLangChain also supports recursive splitting for more control.",
    "The library is designed to handle complex NLP workflows with ease.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]
Enter fullscreen mode Exit fullscreen mode
  1. Get Chunks Containing a Specific Phrase (e.g., "AI Model"): Since "AI Model" is not present in the provided text, this will return an empty list.

Expected Output:

Chunks Containing 'AI Model': []
Enter fullscreen mode Exit fullscreen mode
  1. Get All Chunks Except the First and Last: This will return all chunks except the first and last one. If there are fewer than 3 chunks, this will return an empty list.

Expected Output:

Middle Chunks (excluding first and last): [
    "You can customize the splitting behavior with options like separators and chunk sizes.\nLangChain also supports recursive splitting for more control.",
    "The library is designed to handle complex NLP workflows with ease."
]
Enter fullscreen mode Exit fullscreen mode
  1. Get Chunks with a Certain Range of Indices (e.g., Indices 2 to 6): This will return the chunks with indices from 2 to 6, inclusive. Since there are fewer than 6 chunks, this will return a subset of the available chunks from indices 2 to 4.

Expected Output:

Chunks from Index 2 to 6: [
    "The library is designed to handle complex NLP workflows with ease.",
    "LangChain is flexible, modular, and easy to integrate with other systems."
]
Enter fullscreen mode Exit fullscreen mode
  1. Extract Specific Words from Each Chunk (e.g., First Word in Each Chunk): This will extract the first word from each chunk. The first word is obtained by splitting each chunk and taking the first element.

Expected Output:

First Words from Each Chunk: [
    'LangChain', 'You', 'The', 'LangChain'
]
Enter fullscreen mode Exit fullscreen mode

Explanation of Each Step

Step 1: Load the text to be split.
Step 2: Use CharacterTextSplitter with parameters to split the text based on new lines (\n) and define the chunk size and overlap.
Step 3: Use list comprehensions or indexing to extract specific parts of the chunks list, such as:
The first and last chunk
Chunks with specific keywords
Chunks with a certain length
Odd-indexed chunks
First 5 chunks
Chunks containing specific phrases
All chunks except the first and last
Chunks within a specified range
The first word from each chunk
Enter fullscreen mode Exit fullscreen mode

Basic Text Splitting by Character Count

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(sample_text)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

['LangChain is a powerful library for building applicatio', 'ns with language models. It supports text splitting, tokeniz', 'ation, metadata handling, and more. With LangChain, you can p', 'rocess large documents and work with structured text effectiv', 'ely.']
Enter fullscreen mode Exit fullscreen mode
  1. Recursive Text Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(separators=["\n", ".", " "], chunk_size=40)
chunks = recursive_splitter.split_text(sample_text)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

['LangChain is a powerful library for building applications with language models', 'It supports text splitting, tokenization, metadata handling, and more', 'With LangChain, you can process large documents and work with structured text effectively']
Enter fullscreen mode Exit fullscreen mode
  1. Token-Based Text Splitting
from langchain.text_splitter import TokenTextSplitter

token_splitter = TokenTextSplitter(chunk_size=10)
chunks = token_splitter.split_text(sample_text)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

['LangChain', 'is', 'a', 'powerful', 'library', 'for', 'building', 'applications', 'with', 'language', 'models', 'It', 'supports', 'text', 'splitting', 'tokenization', 'metadata', 'handling', 'and', 'more', 'With', 'LangChain', 'you', 'can', 'process', 'large', 'documents', 'and', 'work', 'with', 'structured', 'text', 'effectively']
Enter fullscreen mode Exit fullscreen mode
  1. Sentence-Based Text Splitting
from langchain.text_splitter import SentenceTextSplitter

sentence_splitter = SentenceTextSplitter(chunk_size=1)  # Each chunk is a sentence
chunks = sentence_splitter.split_text(sample_text)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

['LangChain is a powerful library for building applications with language models.',
 'It supports text splitting, tokenization, metadata handling, and more.',
 'With LangChain, you can process large documents and work with structured text effectively.']
Enter fullscreen mode Exit fullscreen mode
  1. Text Splitting with Metadata Addition
splitter = CharacterTextSplitter(chunk_size=50)
texts = splitter.split_text(sample_text)
docs = [{"text": chunk, "metadata": {"source": "sample"}} for chunk in texts]
print(docs)
Enter fullscreen mode Exit fullscreen mode

Output:

[{'text': 'LangChain is a powerful library for building applicatio', 'metadata': {'source': 'sample'}},  {'text': 'ns with language models. It supports text splitting, tokeniz', 'metadata': {'source': 'sample'}},  {'text': 'ation, metadata handling, and more. With LangChain, you can p', 'metadata': {'source': 'sample'}},  {'text': 'rocess large documents and work with structured text effectiv', 'metadata': {'source': 'sample'}},  {'text': 'ely.', 'metadata': {'source': 'sample'}}]
Enter fullscreen mode Exit fullscreen mode
  1. HTML-Based Header Splitting
from langchain.text_splitter import HTMLHeaderTextSplitter

html_text = "<h1>Title</h1><p>This is the first paragraph.</p><h2>Subheading</h2><p>Another paragraph.</p>"
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")])
chunks = html_splitter.split_text(html_text)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

['<h1>Title</h1><p>This is the first paragraph.</p>', 
 '<h2>Subheading</h2><p>Another paragraph.</p>']
Enter fullscreen mode Exit fullscreen mode
  1. Create Documents with Custom Metadata
from datetime import datetime

splitter = CharacterTextSplitter(chunk_size=50)
texts = splitter.split_text(sample_text)
docs = [{"text": chunk, "metadata": {"timestamp": datetime.now().isoformat(), "author": "Admin"}} for chunk in texts]
print(docs)
Enter fullscreen mode Exit fullscreen mode

Output:

[{'text': 'LangChain is a powerful library for building applicatio', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'ns with language models. It supports text splitting, tokeniz', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'ation, metadata handling, and more. With LangChain, you can p', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'rocess large documents and work with structured text effectiv', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}},  {'text': 'ely.', 'metadata': {'timestamp': '2024-11-14T13:30:00', 'author': 'Admin'}}]
Enter fullscreen mode Exit fullscreen mode
  1. JSON-Based Text Splitting
from langchain.text_splitter import JsonTextSplitter

json_data = [{"content": sample_text}]
json_splitter = JsonTextSplitter(chunk_key="content", chunk_size=50)
chunks = json_splitter.split_text(json_data)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

[{'content': 'LangChain is a powerful library for building applicatio'},  {'content': 'ns with language models. It supports text splitting, tokeniz'},  {'content': 'ation, metadata handling, and more. With LangChain, you can p'},  {'content': 'rocess large documents and work with structured text effectiv'},  {'content': 'ely.'}]
Enter fullscreen mode Exit fullscreen mode
  1. Sequential Numbering for Chunks
splitter = CharacterTextSplitter(chunk_size=50)
texts = splitter.split_text(sample_text)
numbered_chunks = [f"{i+1}: {chunk}" for i, chunk in enumerate(texts)]
print(numbered_chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

['1: LangChain is a powerful library for building applicatio', 
 '2: ns with language models. It supports text splitting, tokeniz', 
 '3: ation, metadata handling, and more. With LangChain, you can p', 
 '4: rocess large documents and work with structured text effectiv', 
 '5: ely.']
Enter fullscreen mode Exit fullscreen mode

Flatten Nested JSON Before Splitting

from langchain.text_splitter import JsonTextSplitter

nested_json = {"data": {"sections": [{"content": sample_text}]}}
json_splitter = JsonTextSplitter(chunk_key="content", flatten_json=True)
chunks = json_splitter.split_text(nested_json)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

[{'content': 'LangChain is a powerful library for building applicatio'},  {'content': 'ns with language models. It supports text splitting, tokeniz'},  {'content': 'ation, metadata handling, and more. With LangChain, you can p'},  {'content': 'rocess large documents and work with structured text effectiv'},  {'content': 'ely.'}]
Enter fullscreen mode Exit fullscreen mode

SUMMARY

how to get list of text string using langchain

Basic Character Text Splitter==CharacterTextSplitter ||
Splitting by Words==WordTextSplitter||Splitting by Sentence==SentenceTextSplitter
Splitting by Token Count||
Customized Recursive Character Text Splitter with Custom Separators
Enter fullscreen mode Exit fullscreen mode
How to print first element of list after split==text[0],text[1]
Enter fullscreen mode Exit fullscreen mode

How to get list of string text on specific filtering condition====

First Chunk/first element of list(chunks[0])====Last Chunk After Splitting(chunks[-1])
Enter fullscreen mode Exit fullscreen mode
Chunks with Specific Word==chunk for chunk in chunks if "LangChain" in chunk
Enter fullscreen mode Exit fullscreen mode
Chunks Longer than a Certain Length==[chunk for chunk in chunks if len(chunk) > 100]
Enter fullscreen mode Exit fullscreen mode
First 5 Chunks(chunks[:5])
Enter fullscreen mode Exit fullscreen mode
Extract Specific Words from Each Chunk (e.g., First Word in Each Chunk)==[chunk.split()[0] for chunk in chunks if chunk.split()]
Enter fullscreen mode Exit fullscreen mode
Sequential Numbering for Chunks==numbered_chunks = [f"{i+1}: {chunk}" for i, chunk in enumerate(texts)]
Enter fullscreen mode Exit fullscreen mode
jsonsplitting==json_data = [{"content": sample_text}] ||json_splitter = JsonTextSplitter(chunk_key="content", chunk_size=50)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)