Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

Efficiently Convert JSON, XML, and CSV Data into Documents for NLP with LangChain

Advantage of document form conversion

converting your data into document form for use with LangChain offers several advantages, particularly when dealing with large datasets or performing tasks like information retrieval, summarization, question-answering, and content generation. Here are the key benefits:

  1. Modular Processing Chunking: Breaking your data into smaller, manageable pieces (documents) allows LangChain to process the information more effectively. For large JSON data or text, it can be challenging to process everything in one go. Dividing it into chunks lets you process individual parts separately, making the system more efficient. Memory Efficiency: Working with smaller chunks in memory reduces the load on system resources, making the process faster and more efficient, especially when dealing with large data sets.
  2. Improved Text Retrieval Optimized Search: When your data is structured as documents, LangChain can efficiently search and retrieve relevant chunks based on keywords, context, or queries. You can index these chunks, making searches faster and more accurate. Relevance: By breaking down data into documents, each part of your data is more likely to be relevant when answering specific questions. It allows LangChain to pull contextually relevant chunks to generate a more accurate response.
  3. Fine-Tuned Language Models Focused Processing: LangChain can process each chunk independently, providing more focused and context-aware responses. If you had all your data in one large text block, the model might struggle with keeping track of which piece of data relates to which context. Contextual Understanding: Breaking data into smaller documents allows LangChain to better understand the context for each chunk, improving language model performance, especially for tasks like summarization, question-answering, and content generation.
  4. Ease of Manipulation and Transformation Document-Level Metadata: Each document can carry metadata that provides additional context. This could be information like the source, chunk number, or timestamps. This metadata can be used to track data transformations, filter documents, or track relevance to specific queries. Flexibility: After splitting the data, you can manipulate each chunk individually. For example, you can summarize each chunk, extract specific details, or even apply different models to different chunks based on their content or structure. Custom Transformations: You can apply custom processing to the documents. For instance, different pieces of the document might require different natural language processing tasks (e.g., one chunk might require entity recognition, while another might require sentiment analysis).
  5. Scalability Parallel Processing: By converting data into documents, LangChain can process these smaller documents in parallel, improving the speed of operations. This is especially beneficial for large datasets that need to be processed in a distributed or parallel environment. Efficient Storage: Each document can be stored separately in a database, allowing for more efficient querying and indexing, especially in systems that use large language models or search engines.
  6. Better Handling of Long Texts Avoiding Length Limitations: Many large language models (including those used in LangChain) have a maximum token limit. By splitting your data into smaller chunks (documents), you ensure that no individual piece of data exceeds these limits, allowing you to process everything without truncation or overflow.
  7. Enhanced Fine-Grained Control Detailed Responses: When working with documents, LangChain can retrieve specific chunks of data based on user queries, leading to more granular and relevant responses. For example, when a user asks a question, LangChain can pinpoint which document or chunk of data is most relevant to answer that question. Fine-Grained Adjustments: You can modify individual documents for a more controlled output. For instance, if one chunk requires further refinement or needs additional data processing, it can be updated without affecting the rest of the dataset.
  8. Easier Integration with Other Systems Interoperability: Documents can be easily integrated into existing text processing systems, such as document search engines, databases, or analytics platforms. LangChain's document-based structure aligns with many existing tools in the natural language processing (NLP) ecosystem. Custom Pipelines: You can design custom pipelines to handle document-level tasks, like summarization, extraction, or transformation, and apply different models or processing steps to individual documents as needed. Example Use Cases with LangChain: Question-Answering: If you want to ask specific questions about a large dataset, converting the data into documents makes it easier to retrieve relevant pieces of information.

Example: For a question like "What are the features of Item 2?", LangChain can search through the documents to find the chunk related to "Item 2" and provide an accurate answer.
Summarization: If your dataset is extensive and you need to summarize parts of it, chunking it into documents allows you to summarize each chunk individually before combining them into a final summary.

Text Classification: When you need to classify or tag parts of the data, you can use the document form to classify each chunk and later aggregate the results.

Content Generation: If you're generating new content based on structured data (such as JSON), chunking the original data into smaller documents makes it easier to generate content that is contextually relevant and coherent.

In summary, converting data into a document format with LangChain gives you greater flexibility, efficiency, and control when performing tasks such as information retrieval, summarization, and manipulation. It ensures that each chunk of your data can be processed and handled independently, improving the overall performance and scalability of your system

Text Splitting (Sentence-based Splitter)

from langchain.text_splitters import SentenceSplitter

text = "LangChain is a powerful framework. It simplifies NLP tasks. Learn it with examples."
splitter = SentenceSplitter()
sentences = splitter.split(text)
documents = splitter.create_documents(texts=sentences)

for doc in documents[:3]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:
list of dictionaries

[{"content": "This is the first sentence."}, {"content": "This is the second one."}]
Enter fullscreen mode Exit fullscreen mode
Document: LangChain is a powerful framework.
Document: It simplifies NLP tasks.
Document: Learn it with examples.
Enter fullscreen mode Exit fullscreen mode
  1. Text Splitting (Word-based Splitter)
from langchain.text_splitters import WordSplitter

text = "LangChain helps you handle NLP tasks efficiently."
splitter = WordSplitter()
words = splitter.split(text)
documents = splitter.create_documents(texts=words)

for doc in documents[:3]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: LangChain
Document: helps
Document: you
Enter fullscreen mode Exit fullscreen mode
  1. Text Splitting (Character-based Splitter)
from langchain.text_splitters import CharacterSplitter

text = "LangChain Framework"
splitter = CharacterSplitter()
characters = splitter.split(text)
documents = splitter.create_documents(texts=characters)

for doc in documents[:3]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: L
Document: a
Document: n
Enter fullscreen mode Exit fullscreen mode
  1. JSON Splitting (Recursive JSON Splitter)
from langchain.text_splitters import RecursiveJsonSplitter

json_data = {
    "title": "LangChain Tutorial",
    "sections": [
        {"heading": "Introduction", "content": "LangChain simplifies NLP."},
        {"heading": "Installation", "content": "Install LangChain with pip."}
    ]
}
splitter = RecursiveJsonSplitter(max_chunk_size=100)
json_chunks = splitter.split_json(json_data)
documents = splitter.create_documents(texts=json_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: {"title": "LangChain Tutorial"}
Document: {"heading": "Introduction", "content": "LangChain simplifies NLP."}
Enter fullscreen mode Exit fullscreen mode
  1. Unstructured Text Splitting (Line-based Splitter)
from langchain.text_splitters import LineSplitter

text = """LangChain is an NLP framework.
It helps with task automation.
You can learn it via tutorials."""
splitter = LineSplitter()
lines = splitter.split(text)
documents = splitter.create_documents(texts=lines)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: LangChain is an NLP framework.
Document: It helps with task automation.
Enter fullscreen mode Exit fullscreen mode

Unstructured Text Splitting (Paragraph-based Splitter)

from langchain.text_splitters import ParagraphSplitter

text = """LangChain is an NLP framework.
It helps you automate tasks in NLP applications.

It supports various NLP models."""
splitter = ParagraphSplitter()
paragraphs = splitter.split(text)
documents = splitter.create_documents(texts=paragraphs)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: LangChain is an NLP framework. It helps you automate tasks in NLP applications.
Document: It supports various NLP models.
URL Splitting (Using BeautifulSoup)

import requests
from langchain.text_splitters import HtmlSplitter
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
splitter = HtmlSplitter()
html_chunks = splitter.split(str(soup))
documents = splitter.create_documents(texts=html_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: <html><head><title>Example Domain</title></head><body>...</body></html>
Document: <p>This domain is established to be used for illustrative examples...</p>
Enter fullscreen mode Exit fullscreen mode

PDF Splitting (Using PyMuPDF)

import fitz  # PyMuPDF
from langchain.text_splitters import TextSplitter

pdf_file = "sample.pdf"
doc = fitz.open(pdf_file)
text = ""
for page in doc:
    text += page.get_text()

splitter = TextSplitter()
pdf_chunks = splitter.split(text)
documents = splitter.create_documents(texts=pdf_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: This is the first page of the PDF document.
Document: Here's the second page with more content.
Enter fullscreen mode Exit fullscreen mode

Text Splitting (Paragraph-based Splitter with Custom Separator)

from langchain.text_splitters import ParagraphSplitter

text = """First paragraph.
Second paragraph follows here.
Third paragraph continues."""
splitter = ParagraphSplitter(separator="\n")
paragraphs = splitter.split(text)
documents = splitter.create_documents(texts=paragraphs)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: First paragraph.
Document: Second paragraph follows here
Enter fullscreen mode Exit fullscreen mode

.
Text Splitting (Custom Chunk Size Splitter)

from langchain.text_splitters import FixedSizeSplitter

text = "This is a sample text. LangChain makes NLP easy. Split text into chunks for processing."
splitter = FixedSizeSplitter(chunk_size=30)
chunks = splitter.split(text)
documents = splitter.create_documents(texts=chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: This is a sample text. Lang
Document: Chain makes NLP easy. Split tex
Enter fullscreen mode Exit fullscreen mode
  1. CSV Data to Document (Using CSV Splitter)
import pandas as pd
from langchain.text_splitters import CsvSplitter

csv_data = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35]
})

splitter = CsvSplitter()
csv_chunks = splitter.split(csv_data)
documents = splitter.create_documents(texts=csv_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: {'name': 'Alice', 'age': 25}
Document: {'name': 'Bob', 'age': 30}
Enter fullscreen mode Exit fullscreen mode
  1. Text Splitting (Paragraph-based Splitter with Regex)
from langchain.text_splitters import ParagraphSplitter

text = """First paragraph with some content.
Second paragraph includes more details.
Third paragraph at the end of this text."""

splitter = ParagraphSplitter(separator=re.compile(r"\n"))
paragraphs = splitter.split(text)
documents = splitter.create_documents(texts=paragraphs)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: First paragraph with some content.
Document: Second paragraph includes more details.
Enter fullscreen mode Exit fullscreen mode

XML Data to Document (Using XML Splitter)

from langchain.text_splitters import XmlSplitter
from lxml import etree

xml_data = """
<root>
  <item><title>Item 1</title><description>Details of item 1</description></item>
  <item><title>Item 2</title><description>Details of item 2</description></item>
</root>
"""
tree = etree.fromstring(xml_data)
splitter = XmlSplitter()
xml_chunks = splitter.split(tree)
documents = splitter.create_documents(texts=xml_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: <item><title>Item 1</title><description>Details of item 1</description></item>
Document: <item><title>Item 2</title><description>Details of item 2</description></item>
Enter fullscreen mode Exit fullscreen mode

Web Scraping and Splitting (Using HTML Parser)

from langchain.text_splitters import HtmlSplitter
from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

splitter = HtmlSplitter()
html_chunks = splitter.split(str(soup))
documents = splitter.create_documents(texts=html_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: <html><head><title>Example Domain</title></head><body>...</body></html>
Document: <p>This domain is established...</p>
Enter fullscreen mode Exit fullscreen mode

Text and JSON Combined (Nested Structure)

from langchain.text_splitters import RecursiveJsonSplitter

json_data = {
    "title": "LangChain Examples",
    "details": [
        {"section": "Introduction", "content": "LangChain simplifies NLP."},
        {"section": "Setup", "content": "Install LangChain easily."}
    ]
}

splitter = RecursiveJsonSplitter(max_chunk_size=100)
chunks = splitter.split_json(json_data)
documents = splitter.create_documents(texts=chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: {"title": "LangChain Examples"}
Document: {"section": "Introduction", "content": "LangChain simplifies NLP."}
Enter fullscreen mode Exit fullscreen mode

Handling Large Text Files (Line Splitting)

from langchain.text_splitters import LineSplitter

text = "This is a long text.\nEach line will be a separate document."
splitter = LineSplitter()
lines = splitter.split(text)
documents = splitter.create_documents(texts=lines)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: This is a long text.
Document: Each line will be a separate document.
Enter fullscreen mode Exit fullscreen mode

Splitting Based on Regex Patterns

from langchain.text_splitters import RegexSplitter

text = "Name: Alice Age: 25 Name: Bob Age: 30"
splitter = RegexSplitter(pattern=r"Name:\s*[\w]+")
chunks = splitter.split(text)
documents = splitter.create_documents(texts=chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: Age: 25
Document: Age: 30
Enter fullscreen mode Exit fullscreen mode

PDF Content to Document (Text Splitting)

import PyPDF2
from langchain.text_splitters import TextSplitter

with open("document.pdf", "rb") as file:
    pdf_reader = PyPDF2.PdfReader(file)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()

splitter = TextSplitter()
pdf_chunks = splitter.split(text)
documents = splitter.create_documents(texts=pdf_chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: Page 1 content...
Document: Page 2 content...
Enter fullscreen mode Exit fullscreen mode
  1. Splitting Text for NLP (Custom Chunk Sizes)
from langchain.text_splitters import FixedSizeSplitter

text = "LangChain allows you to split text into fixed-size chunks for NLP tasks."
splitter = FixedSizeSplitter(chunk_size=40)
chunks = splitter.split(text)
documents = splitter.create_documents(texts=chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: LangChain allows you to split text
Document: into fixed-size chunks for NLP tasks.
Enter fullscreen mode Exit fullscreen mode
  1. Text Splitting Using Character Length (Fixed Size)
from langchain.text_splitters import FixedSizeSplitter

text = "LangChain Framework is powerful. It simplifies NLP tasks."
splitter = FixedSizeSplitter(chunk_size=25)
chunks = splitter.split(text)
documents = splitter.create_documents(texts=chunks)

for doc in documents[:2]:
    print(doc)
Enter fullscreen mode Exit fullscreen mode

Output:

Document: LangChain Framework is
Document: powerful. It simplifies NLP
Enter fullscreen mode Exit fullscreen mode

SUMMARY

Advantage of document form conversion=Modular Processing,Improved Text Retrieval,Fine-Tuned Language Models,Ease of Manipulation and Transformation,Scalability ,Easier Integration with Other Systems
Text Splitting (Sentence-based Splitter)==>SentenceSplitter().create_documents(texts=SentenceSplitter().split(text))
output== list of dictionaries
WordSplitter().create_documents(texts=WordSplitter().split(text))
CharacterSplitter().create_documents(texts=CharacterSplitter().split(text))
json splitter== json object contain two object one is single string(tile) other is in array form array contain two object(multiple section each section contain 2 element heading and content)
RecursiveJsonSplitter(max_chunk_size=100).create_documents(texts=RecursiveJsonSplitter(max_chunk_size=100).split_json(json_data))
unstructured data==LineSplitter().create_documents(texts=LineSplitter().split(text))
# Remove unnecessary spaces and tokenize
cleaned_text = " ".join(word_tokenize(raw_text))
ParagraphSplitter().create_documents(texts=ParagraphSplitter().split(text))
URL Splitting (Using BeautifulSoup)====
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
 HtmlSplitter().create_documents(texts=HtmlSplitter().split(str(soup))
PDF Splitting (Using PyMuPDF)================
doc = fitz.open(pdf_file)===>for page in doc:==> text += page.get_text()==>TextSplitter().create_documents(texts=TextSplitter().split(text))
paragraph splitter==
 ParagraphSplitter(separator="\n").create_documents(texts=ParagraphSplitter(separator="\n").split(text))
CSV Data to Document (Using CSV Splitter)===
csv_data = pd.DataFrame({"name": ["Alice", "Bob", "Charlie"],"age": [25, 30, 35]})
documents = [{"index": i, "content": content} for i, content in enumerate(CsvSplitter().create_documents(texts=CsvSplitter().split(csv_data)))]
Text Splitting (Paragraph-based Splitter with Regex)==
ParagraphSplitter(separator=re.compile(r"\n")).create_documents(texts=ParagraphSplitter(separator=re.compile(r"\n")).split(text))
Splitting Based on Regex Patterns===
 RegexSplitter(pattern=r"Name:\s*[\w]+").create_documents(texts=RegexSplitter(pattern=r"Name:\s*[\w]+").split(text))
XML Data to Document (Using XML Splitter)
xml_data = """
<root>
  <item><title>Item 1</title><description>Details of item 1</description></item>
  <item><title>Item 2</title><description>Details of item 2</description></item>
</root>
"""
tree = etree.fromstring(xml_data)
XmlSplitter().create_documents(texts=XmlSplitter().split(tree))
Enter fullscreen mode Exit fullscreen mode

Top comments (0)