rakesh kumar

Posted on Nov 15, 2024 • Edited on Nov 19, 2024

How to improve Your Workflow for Text Data Splitting and Document Analysis using langchain

Text Splitter Example
This example splits a long text into chunks of sentences using a simple TextSplitter.

from langchain.text_splitters import TextSplitter

# Example long text
text = "This is the first sentence. This is the second sentence. This is the third sentence."

# Split the text into sentences
splitter = TextSplitter(separator=".")
text_chunks = splitter.split_text(text)

# Output each chunk
for chunk in text_chunks:
    print(chunk)

Output:

This is the first sentence
This is the second sentence
This is the third sentence

JSON Splitter Example
This example splits a JSON object into smaller parts based on a specified chunk size.

from langchain.text_splitters import RecursiveJsonSplitter

# Example JSON data
json_data = {
    "item1": "This is item 1",
    "item2": "This is item 2",
    "item3": "This is item 3"
}

# Split the JSON data
json_splitter = RecursiveJsonSplitter(max_chunk_size=20)
json_chunks = json_splitter.split_json(json_data)

# Output each chunk
for chunk in json_chunks:
    print(chunk)

Output:

{'item1': 'This is item 1'}
{'item2': 'This is item 2'}
{'item3': 'This is item 3'}

Unstructured Text Splitter Example
This example splits unstructured text (e.g., paragraphs) into smaller chunks using a line-based separator.

from langchain.text_splitters import LineSplitter

# Example unstructured text
unstructured_text = """
This is a paragraph. 
Here is another paragraph. 
And here's the third paragraph.
"""

# Split text into chunks based on lines
line_splitter = LineSplitter(separator="\n")
line_chunks = line_splitter.split_text(unstructured_text)

# Output each chunk
for chunk in line_chunks:
    print(chunk)

Output:

This is a paragraph.
Here is another paragraph.
And here's the third paragraph.

URL Splitter Example
This example splits a webpage's content (retrieved from a URL) into smaller chunks based on paragraphs.

import requests
from langchain.text_splitters import ParagraphSplitter

# Example URL (replace with a real URL)
url = "https://example.com"

# Fetch content from the URL
response = requests.get(url)
web_content = response.text

# Split web content into paragraphs
paragraph_splitter = ParagraphSplitter()
paragraph_chunks = paragraph_splitter.split_text(web_content)

# Output each chunk
for chunk in paragraph_chunks:
    print(chunk)

Example Scenario:
If the webpage content (web_content) fetched looks like this:

<!DOCTYPE html>
<html>
<head>
    <title>Example Page</title>
</head>
<body>
    <p>This is the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <div>
        This is some content outside a paragraph tag.
    </div>
</body>
</html>

Expected Output:
The ParagraphSplitter would process the above content and split it into paragraphs. The output could look something like:

This is the first paragraph.

This is the second paragraph.

This is some content outside a paragraph ta

PDF Splitter Example
This example splits content from a PDF file into smaller chunks based on paragraphs.

from langchain.text_splitters import ParagraphSplitter
from PyPDF2 import PdfReader

# Example PDF file path
pdf_file_path = "example.pdf"

# Read PDF file
reader = PdfReader(pdf_file_path)
pdf_text = ""
for page in reader.pages:
    pdf_text += page.extract_text()

# Split PDF content into paragraphs
pdf_splitter = ParagraphSplitter()
pdf_chunks = pdf_splitter.split_text(pdf_text)

# Output each chunk
for chunk in pdf_chunks:
    print(chunk)

Step-by-Step Execution
Step 1: Import Modules

from langchain.text_splitters import ParagraphSplitter
from PyPDF2 import PdfReader

Output: No visible output; this imports the necessary libraries.
Step 2: Example PDF File Path

pdf_file_path = "example.pdf"

Output: This simply sets the file path of the PDF (example.pdf) as a string. Ensure that this file exists in your working directory.
Step 3: Read PDF File

reader = PdfReader(pdf_file_path)
pdf_text = ""
for page in reader.pages:
    pdf_text += page.extract_text()

What Happens:

PdfReader reads the entire PDF file.
The for loop iterates over each page of the PDF and extracts the text using page.extract_text().
All extracted text is concatenated into a single string stored in pdf_text.
Example Output of pdf_text: If the PDF contains the following text:

Page 1:
This is the first paragraph on page one.

This is the second paragraph on page one.

Page 2:
This is the first paragraph on page two.

Then, pdf_text will contain:

This is the first paragraph on page one.

This is the second paragraph on page one.

This is the first paragraph on page two.

Step 4: Split PDF Content into Paragraphs

pdf_splitter = ParagraphSplitter()
pdf_chunks = pdf_splitter.split_text(pdf_text)

What Happens:

ParagraphSplitter identifies paragraph boundaries, typically based on newline characters (\n\n) or other patterns in the text.
The split_text method processes pdf_text and splits it into individual paragraphs (chunks).
Example Output of pdf_chunks:

[
    "This is the first paragraph on page one.",
    "This is the second paragraph on page one.",
    "This is the first paragraph on page two."
]

Step 5: Output Each Chunk

for chunk in pdf_chunks:
    print(chunk)

What Happens:

Each paragraph in pdf_chunks is printed line by line.
Example Output:

This is the first paragraph on page one.

This is the second paragraph on page one.

This is the first paragraph on page two.

Full Example Output
Assuming the PDF contains:

Character-based Splitter Example
This example splits text into chunks based on a specified character length.


from langchain.text_splitters import CharacterTextSplitter

# Example long text
long_text = "This is a long text that we want to split into smaller parts."

# Split text into chunks of 10 characters each
char_splitter = CharacterTextSplitter(chunk_size=10)
char_chunks = char_splitter.split_text(long_text)

# Output each chunk
for chunk in char_chunks:
    print(chunk)

Output:

This is a 
long text 
that we wa
nt to spl
it into s
maller pa
rts.

Sentence Splitter Example
This example splits text into individual sentences using the SentenceSplitter.

from langchain.text_splitters import SentenceSplitter

# Example text
text = "This is the first sentence. This is the second sentence."

# Split text into sentences
sentence_splitter = SentenceSplitter()
sentence_chunks = sentence_splitter.split_text(text)

# Output each sentence
for chunk in sentence_chunks:
    print(chunk)

Output:

This is the first sentence.
This is the second sentence.

Paragraph-based Splitter Example
This example splits a long text into paragraphs.

from langchain.text_splitters import ParagraphSplitter

# Example text with paragraphs
text = "This is paragraph one. It talks about something. \n\nThis is paragraph two. It talks about another thing."

# Split text into paragraphs
paragraph_splitter = ParagraphSplitter()
paragraph_chunks = paragraph_splitter.split_text(text)

# Output each paragraph
for chunk in paragraph_chunks:
    print(chunk)

Output:

This is paragraph one. It talks about something.
This is paragraph two. It talks about another thing.

Markdown File Splitter Example
This example splits content from a Markdown file into sections.

from langchain.text_splitters import MarkdownSplitter

# Example markdown text
markdown_text = "# Section 1\nContent for section 1.\n\n# Section 2\nContent for section 2."

# Split markdown text into sections
markdown_splitter = MarkdownSplitter()
markdown_chunks = markdown_splitter.split_text(markdown_text)

# Output each section
for chunk in markdown_chunks:
    print(chunk)

Output:

# Section 1
Content for section 1.

# Section 2
Content for section 2.

CSV Splitter Example
This example splits CSV content into rows and columns.

from langchain.text_splitters import CSVSplitter

# Example CSV content
csv_data = "Name, Age, Location\nAlice, 30, New York\nBob, 25, London"

# Split CSV content into rows and columns
csv_splitter = CSVSplitter()
csv_chunks = csv_splitter.split_text(csv_data)

# Output each row
for chunk in csv_chunks:
    print(chunk)

Output:

['Name', 'Age', 'Location']
['Alice', '30', 'New York']
['Bob', '25', 'London']

Text Chunking Example
This example splits a text based on a predefined number of words.


from langchain.text_splitters import WordTextSplitter

# Example text
text = "This is the first part of the text. This is the second part."

# Split text into chunks of 5 words
word_splitter = WordTextSplitter(chunk_size=5)
word_chunks = word_splitter.split_text(text)

# Output each chunk
for chunk in word_chunks:
    print(chunk)

Output:


This is the first
part of the text.
This is the second
part.

HTML Splitter Example
This example splits HTML content based on specific tags.

from langchain.text_splitters import HTMLSplitter

# Example HTML content
html_content = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"

# Split HTML content into paragraphs
html_splitter = HTMLSplitter(tag="p")
html_chunks = html_splitter.split_text(html_content)

# Output each chunk
for chunk in html_chunks:
    print(chunk)

Output:

This is a paragraph.
This is another paragraph.

XML Splitter Example
This example splits XML content based on specific tags.


from langchain.text_splitters import XMLSplitter

# Example XML content
xml_content = "<item><name>Item 1</name><price>10</price></item><item><name>Item 2</name><price>20</price></item>"

# Split XML content into items
xml_splitter = XMLSplitter(tag="item")
xml_chunks = xml_splitter.split_text(xml_content)

# Output each chunk
for chunk in xml_chunks:
    print(chunk)

Output:

<item><name>Item 1</name><price>10</price></item>
<item><name>Item 2</name><price>20</price></item>

Log File Splitter Example
This example splits a log file into individual log entries.

from langchain.text_splitters import LogFileSplitter

# Example log content
log_content = "2024-11-12 10:00:00 INFO User logged in.\n2024-11-12 10:05:00 ERROR Database connection failed."

# Split log content into entries
log_splitter = LogFileSplitter(separator="\n")
log_chunks = log_splitter.split_text(log_content)

# Output each log entry
for chunk in log_chunks:
    print(chunk)

Output:

2024-11-12 10:00:00 INFO User logged in.
2024-11-12 10:05:00 ERROR Database connection failed.

CSV Row Splitter Example
This example splits CSV content row-wise into individual document entries.

from langchain.text_splitters import CSVRowSplitter

# Example CSV content
csv_data = "Name, Age\nAlice, 30\nBob, 25"

# Split CSV data into rows
csv_row_splitter = CSVRowSplitter()
csv_rows = csv_row_splitter.split_text(csv_data)

# Output each row
for row in csv_rows:
    print(row)

Output:

['Alice', '30']
['Bob', '25']

Text Chunking by Paragraph Example
This example demonstrates splitting large text into paragraphs based on a chunk size.

from langchain.text_splitters import ParagraphTextSplitter

# Example text
large_text = "First paragraph. Second paragraph. Third paragraph."

# Split text into paragraphs
paragraph_splitter = ParagraphTextSplitter(chunk_size=20)
paragraph_chunks = paragraph_splitter.split_text(large_text)

# Output each paragraph chunk
for chunk in paragraph_chunks:
    print(chunk)

Speech-to-Text Splitter Example
This example shows how to split transcribed speech data into sentences.

from langchain.text_splitters import SentenceSplitter

# Example transcribed speech text
speech_text = "Hello. How are you? I am good."

# Split speech text into sentences
speech_splitter = SentenceSplitter()
speech_chunks = speech_splitter.split_text(speech_text)

# Output each sentence
for chunk in speech_chunks:
    print(chunk)

Sentiment Analysis Text Chunk Example
This example demonstrates splitting text for sentiment analysis based on sentence boundaries.

from langchain.text_splitters import SentenceSplitter

# Example text for sentiment analysis
text = "I love this product. It works great!"

# Split into sentences
sentence_splitter = SentenceSplitter()
sentences = sentence_splitter.split_text(text)

# Output sentences for sentiment analysis
for sentence in sentences:
    print(sentence)

Redaction Text Splitter Example
This example shows splitting sensitive data for redaction.

from langchain.text_splitters import SentenceSplitter

# Example text with sensitive information
sensitive_text = "John Doe lives at 1234 Elm St. His phone number is 123-456-7890."

# Split into sentences for redaction
splitter = SentenceSplitter()
sentences = splitter.split_text(sensitive_text)

# Output sentences for redaction
for sentence in sentences:
    print(sentence)

Text Preprocessing Splitter Example
This example preprocesses and splits raw data into clean text chunks for NLP tasks.

from langchain.text_splitters import WordTextSplitter

# Example raw data
raw_data = "Text: This is some raw data; we need to clean it up!"

# Clean and split text into words
preprocessor = WordTextSplitter(chunk_size=3)
clean_chunks = preprocessor.split_text(raw_data)

# Output clean chunks
for chunk in clean_chunks:
    print(chunk)

Debug School

How to improve Your Workflow for Text Data Splitting and Document Analysis using langchain

Top comments (0)