Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

How to Web Content Extraction with HTMLHeaderTextSplitte using langchain framework

Extraxt data from html data
Extraxt data from web url
How to extract metdata from weburl

Extraxt data from html data

Basic Header Splitting (h1, h2)

from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Main Title</h1>
<p>This is the introduction.</p>
<h2>Section 1</h2>
<p>Details about section 1.</p>
<h2>Section 2</h2>
<p>Details about section 2.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Main Title</h1>\n<p>This is the introduction.</p>",
    "<h2>Section 1</h2>\n<p>Details about section 1.</p>",
    "<h2>Section 2</h2>\n<p>Details about section 2.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 2: Nested Header Splitting (h1, h2, h3)

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Main Title</h1>
<p>Introductory paragraph.</p>
<h2>Subsection A</h2>
<p>Details about subsection A.</p>
<h3>Subtopic A1</h3>
<p>Information on subtopic A1.</p>
<h3>Subtopic A2</h3>
<p>Information on subtopic A2.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Main Title</h1>\n<p>Introductory paragraph.</p>",
    "<h2>Subsection A</h2>\n<p>Details about subsection A.</p>",
    "<h3>Subtopic A1</h3>\n<p>Information on subtopic A1.</p>",
    "<h3>Subtopic A2</h3>\n<p>Information on subtopic A2.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 3: Large Chunks with Overlap (h1, h2, Overlap)

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, chunk_size=1000, chunk_overlap=200)

html_text = """
<h1>Main Topic</h1>
<p>This is a very long introductory paragraph...</p>
<h2>Part 1</h2>
<p>Content for part 1...</p>
<h2>Part 2</h2>
<p>Content for part 2...</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:


[
    "<h1>Main Topic</h1>\n<p>This is a very long introductory paragraph...</p>",
    "<h2>Part 1</h2>\n<p>Content for part 1...</p>",
    "<h2>Part 2</h2>\n<p>Content for part 2...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 4: Detailed Sectioning with h4 Headers

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Main Title</h1>
<p>Introduction text.</p>
<h2>Section 1</h2>
<h3>Subsection 1.1</h3>
<h4>Subsubsection 1.1.1</h4>
<p>Details on subsubsection 1.1.1.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Main Title</h1>\n<p>Introduction text.</p>",
    "<h2>Section 1</h2>",
    "<h3>Subsection 1.1</h3>",
    "<h4>Subsubsection 1.1.1</h4>\n<p>Details on subsubsection 1.1.1.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 5: Excluding Specific Headers

headers_to_split_on = [("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Main Title</h1>
<p>This section is excluded as per header rules.</p>
<h2>Included Section</h2>
<p>Content for the included section.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h2>Included Section</h2>\n<p>Content for the included section.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 6: Split by Deeply Nested Headers (h1, h2, h3, h4, h5)

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4"), ("h5", "Header 5")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Main Title</h1>
<h2>Section 1</h2>
<h3>Subsection 1.1</h3>
<h4>Topic 1.1.1</h4>
<h5>Detail 1.1.1.1</h5>
<p>Final detailed content.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Main Title</h1>",
    "<h2>Section 1</h2>",
    "<h3>Subsection 1.1</h3>",
    "<h4>Topic 1.1.1</h4>",
    "<h5>Detail 1.1.1.1</h5>\n<p>Final detailed content.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 7: Small Chunks with Short Overlap

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, chunk_size=100, chunk_overlap=10)

html_text = """
<h1>Overview</h1>
<p>This is a very brief overview.</p>
<h2>Details</h2>
<p>Some additional details here.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Overview</h1>\n<p>This is a very brief overview.</p>",
    "<h2>Details</h2>\n<p>Some additional details here.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 8: Use with HTML from Web Page

url = "https://example.com"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Main Header from Web Page</h1>\n<p>Introductory paragraph.</p>",
    "<h2>Section 1</h2>\n<p>Details about section 1.</p>",
    "<h3>Subsection 1.1</h3>\n<p>Information on subsection 1.1.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 9: Split by Only h2 Headers (Exclude h1)

headers_to_split_on = [("h2", "Header 2")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Excluded Title</h1>
<p>This section is not included in output.</p>
<h2>Included Section</h2>
<p>Content for the included section.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h2>Included Section</h2>\n<p>Content for the included section.</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 10: Complex Document with All Headers

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_text = """
<h1>Main Topic</h1>
<p>Introductory paragraph.</p>
<h2>First Section</h2>
<p>Overview of the first section.</p>
<h3>Subsection 1.1</h3>
<p>Details of subsection 1.1</p>
<h4>Subtopic 1.1.1</h4>
<p>Additional content here.</p>
"""

html_header_splits = html_splitter.split_text(html_text)
Enter fullscreen mode Exit fullscreen mode

Output:

[
    "<h1>Main Topic</h1>\n<p>Introductory paragraph.</p>",
    "<h2>First Section</h2>\n<p>Overview of the first section.</p>",
    "<h3>Subsection 1.1</h3>\n<p>Details of subsection 1.1</p>",
    "<h4>Subtopic 1.1.1</h4>\n<p>Additional content here.</p>"
]
Enter fullscreen mode Exit fullscreen mode

These examples demonstrate how HTMLHeaderTextSplitter can be customized for different splitting needs by adjusting headers, chunk sizes, and overlap settings. Each configuration produces structured, organized output based on HTML headings.

Extraxt data from web url

Example 1: Basic Split with Main Headers (h1, h2)

from langchain.text_splitter import HTMLHeaderTextSplitter

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>Artificial Intelligence</h1>\n<p>Introduction to AI...</p>",
    "<h2>History</h2>\n<p>Overview of AI's history...</p>",
    "<h2>Applications</h2>\n<p>AI applications include...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 2: Deep Split Including Subsections (h1, h2, h3)

url = "https://www.who.int/news-room/fact-sheets/detail/climate-change-and-health"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>Climate Change and Health</h1>\n<p>Introduction to climate change...</p>",
    "<h2>Overview</h2>\n<p>Key points on climate change and health...</p>",
    "<h3>Impact on Health</h3>\n<p>Details on health impacts...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 3: Full Document Structure with Detailed Subsections (h1, h2, h3, h4)

url = "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Introduction"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>JavaScript Guide</h1>\n<p>Introductory content on JavaScript...</p>",
    "<h2>Overview</h2>\n<p>Introduction to JavaScript basics...</p>",
    "<h3>Features</h3>\n<p>JavaScript's core features...</p>",
    "<h4>Data Types</h4>\n<p>Explanation of JavaScript data types...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 4: Large Chunks with Overlap (h1, h2, with Overlap)

url = "https://www.un.org/en/about-us/history-of-the-un"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, chunk_size=1500, chunk_overlap=200)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>History of the United Nations</h1>\n<p>Overview of UN's formation...</p>",
    "<h2>Founding and Objectives</h2>\n<p>Details on founding principles...</p>",
    "<h2>Major Events</h2>\n<p>Key events in UN history...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 5: Split by Selected Subsections (h2, h3 Only)

url = "https://www.nasa.gov/mission_pages/apollo/missions/index.html"
headers_to_split_on = [("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h2>Apollo Missions</h2>\n<p>Overview of Apollo program...</p>",
    "<h3>Apollo 11</h3>\n<p>Details of the Apollo 11 mission...</p>",
    "<h3>Apollo 13</h3>\n<p>Details of the Apollo 13 mission...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 6: Small Chunks for Specific Details (h2, h3, h4 with Small Chunk Size)

url = "https://www.fao.org/sustainable-development-goals"
headers_to_split_on = [("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on, chunk_size=500, chunk_overlap=50)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h2>FAO's Role in SDGs</h2>\n<p>FAO contributions to SDGs...</p>",
    "<h3>Goal 2: Zero Hunger</h3>\n<p>FAO efforts in reducing hunger...</p>",
    "<h4>Sub-goal 2.1</h4>\n<p>Targeting food security...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 7: Extract Only Main Topics (h1 Only)

url = "https://www.w3.org/standards/"
headers_to_split_on = [("h1", "Header 1")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>Web Standards</h1>\n<p>Overview of W3C standards...</p>",
    "<h1>W3C Technologies</h1>\n<p>Introduction to W3C's role...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 8: Multi-level Sections and Subsections (h1, h2, h3, h4, h5)

url = "https://www.nationalgeographic.com/environment/"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4"), ("h5", "Header 5")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>Environment</h1>\n<p>Introductory content on environmental topics...</p>",
    "<h2>Climate Change</h2>\n<p>Articles on climate change...</p>",
    "<h3>Impact on Oceans</h3>\n<p>Details on ocean impacts...</p>",
    "<h4>Sea Level Rise</h4>\n<p>Implications of sea level rise...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 9: Extract Headers with Specific Topics (h2, h3)

url = "https://data.unicef.org/topic/education/"
headers_to_split_on = [("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h2>Education</h2>\n<p>Overview of UNICEF's education initiatives...</p>",
    "<h3>Primary Education</h3>\n<p>Focus on primary education...</p>",
    "<h3>Secondary Education</h3>\n<p>Details on secondary education...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Example 10: All Headers for Full Document Parsing (h1, h2, h3, h4, h5, h6)

url = "https://www.health.harvard.edu/topics/mental-health"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4"), ("h5", "Header 5"), ("h6", "Header 6")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<h1>Mental Health</h1>\n<p>Introduction to mental health topics...</p>",
    "<h2>Common Conditions</h2>\n<p>Overview of mental health conditions...</p>",
    "<h3>Anxiety Disorders</h3>\n<p>Details on anxiety disorders...</p>",
    "<h4>Treatment Options</h4>\n<p>Various treatment options explained...</p>"
]
Enter fullscreen mode Exit fullscreen mode

Splitting by Anchor Tags () to Extract Links

from langchain.text_splitter import HTMLHeaderTextSplitter

url = "https://en.wikipedia.org/wiki/Web_scraping"
headers_to_split_on = [("a", "Anchor")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)
Enter fullscreen mode Exit fullscreen mode

Expected Output:

[
    "<a href='/wiki/Data_scraping'>Data scraping</a>",
    "<a href='/wiki/Robotic_process_automation'>Robotic process automation</a>",
    "<a href='/wiki/Web_crawler'>Web crawler</a>"
]
Enter fullscreen mode Exit fullscreen mode

Explanation: This example extracts all anchor tags () from the Wikipedia page on web scraping, allowing you to retrieve all internal and external links within the page.

Example 2: Splitting by Table Rows () for Structured Data Extraction

url = "https://www.worldometers.info/world-population/"
headers_to_split_on = [("tr", "Table Row")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

Expected Output:

[
    "<tr><td>World Population</td><td>7,800,000,000</td></tr>",
    "<tr><td>Asia</td><td>4,641,054,775</td></tr>",
    "<tr><td>Africa</td><td>1,340,598,147</td></tr>"
]

Explanation: This example splits each row of the world population table into a separate chunk, enabling easy access to structured population data by region.

Example 3: Splitting by Paragraph Tags (

) for Detailed Content Segmentation


url = "https://www.bbc.com/news/world"
headers_to_split_on = [("p", "Paragraph")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

Expected Output:

[
    "<p>The latest global news...</p>",
    "<p>Updates on recent events around the world...</p>",
    "<p>COVID-19 cases have continued to decline...</p>"
]

Explanation: Each paragraph (

) is split individually, which is useful for summarizing or analyzing individual pieces of content, like news updates or article summaries.

Example 4: Extracting Hyperlinks from Specific Sections with Headers and Anchors (h2, a)

url = "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide"
headers_to_split_on = [("h2", "Header 2"), ("a", "Anchor")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

Expected Output:

[
    "<h2>JavaScript Basics</h2>",
    "<a href='/docs/Web/JavaScript/Data_structures'>Data Structures</a>",
    "<a href='/docs/Web/JavaScript/Closures'>Closures</a>",
    "<h2>Control Flow</h2>",
    "<a href='/docs/Web/JavaScript/Control_flow'>Control Flow Guide</a>"
]

Explanation: This splits the content into sections by headers (h2) and captures links within each section (a), making it easier to create a navigable content structure.

Example 5: Splitting by Table Cells () for Tabular Data

url = "https://www.imdb.com/chart/top"
headers_to_split_on = [("td", "Table Cell")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

Expected Output:

[
    "<td>The Shawshank Redemption</td>",
    "<td>1994</td>",
    "<td>9.3</td>",
    "<td>The Godfather</td>",
    "<td>1972</td>",
    "<td>9.2</td>"
]

Explanation: This example parses individual table cells (


) from IMDb's Top 250 list, making it easy to isolate movie titles, release years, and ratings for further processing.

Example 6: Combining Headers and Paragraphs (h1, p) for Article Summaries

url = "https://www.scientificamerican.com/article/what-is-climate-change/"
headers_to_split_on = [("h1", "Header 1"), ("p", "Paragraph")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

Expected Output:

[
    "<h1>What is Climate Change?</h1>",
    "<p>Climate change refers to...</p>",
    "<p>The effects of climate change...</p>"
]

Explanation: This example captures the main title (h1) and each paragraph (p), which is useful for summarizing articles by retrieving only the key content.

Example 7: Extracting Data from a Mixture of Tags (h2, td)

url = "https://www.worldbank.org/en/news/factsheet"
headers_to_split_on = [("h2", "Header 2"), ("td", "Table Cell")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

html_header_splits = html_splitter.split_text_from_url(url)

Expected Output:

[
    "<h2>Global Development Indicators</h2>",
    "<td>GDP Growth</td>",
    "<td>2.5%</td>",
    "<h2>Environmental Statistics</h2>",
    "<td>CO2 Emissions</td>",
    "<td>5.1 Metric Tons</td>"
]

Explanation: By combining headers (h2) with table cells (td), this example retrieves a mix of section headers and key data points, which can be valuable for presenting facts or statistics.

Example 8: Splitting by List Items (

  • ) for Bullet Point Summaries

  • url = "https://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/prevention.html"
    headers_to_split_on = [("li", "List Item")]
    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    html_header_splits = html_splitter.split_text_from_url(url)
    

    Expected Output:

    [
        "<li>Wash your hands often...</li>",
        "<li>Maintain a safe distance...</li>",
        "<li>Wear a mask...</li>"
    ]
    

    Explanation: This example captures each list item (

  • ) from the CDC's COVID-19 prevention page, making it easy to create a checklist or summary of prevention guidelines.

    Example 9: Extracting Content from Sections and Links (h2, a)

    url = "https://www.python.org/about/apps/"
    headers_to_split_on = [("h2", "Header 2"), ("a", "Anchor")]
    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    html_header_splits = html_splitter.split_text_from_url(url)
    

    Expected Output:

    [
        "<h2>Web Development</h2>",
        "<a href='/about/apps/#web-development'>Learn More</a>",
        "<h2>Data Science</h2>",
        "<a href='/about/apps/#data-science'>Learn More</a>"
    ]
    

    Explanation: This example pulls out headers (h2) and links (a), making it suitable for collecting information on Python applications across different domains.

    Example 10: Extracting Headings and Links for Navigation Structure (h1, a)

    url = "https://www.fda.gov/food"
    headers_to_split_on = [("h1", "Header 1"), ("a", "Anchor")]
    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    html_header_splits = html_splitter.split_text_from_url(url)
    

    Expected Output:

    [
        "<h1>Food</h1>",
        "<a href='/food/food-safety'>Food Safety</a>",
        "<a href='/food/nutrition'>Nutrition</a>",
        "<a href='/food/labeling-nutrition'>Labeling & Nutrition</a>"
    ]
    

    Extract specific table data

    import requests
    from bs4 import BeautifulSoup
    from langchain.text_splitter import HTMLHeaderTextSplitter
    
    # Step 1: Fetch the HTML content of the webpage
    url = "https://example.com"  # Replace with the actual URL containing multiple tables
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Step 2: Locate and isolate the specific table by unique identifier
    # Example: finding by class or id, or selecting by position in the list of tables
    # Adjust "specific-table-class" to the unique class or id of the desired table
    target_table = soup.find("table", {"class": "specific-table-class"})  # Or use {"id": "specific-table-id"}
    
    # Step 3: Convert the table HTML to a string for further processing
    table_html = str(target_table)
    
    # Step 4: Initialize HTMLHeaderTextSplitter with desired tags within the table (e.g., <tr> for rows, <td> for cells)
    headers_to_split_on = [("tr", "Table Row"), ("td", "Table Cell")]
    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    # Step 5: Split the content of the specific table
    html_header_splits = html_splitter.split_text(table_html)
    
    # Step 6: Output the extracted chunks from the specific table
    for chunk in html_header_splits:
        print(chunk)
    

    Expected Output
    For example, if the table has this structure:

    <table class="specific-table-class">
        <tr><td>Country</td><td>GDP</td><td>Population</td></tr>
        <tr><td>USA</td><td>21.43 Trillion</td><td>331 Million</td></tr>
        <tr><td>China</td><td>14.34 Trillion</td><td>1.4 Billion</td></tr>
    </table>
    

    The output would look like this:

    [
        "<tr><td>Country</td><td>GDP</td><td>Population</td></tr>",
        "<td>Country</td>",
        "<td>GDP</td>",
        "<td>Population</td>",
        "<tr><td>USA</td><td>21.43 Trillion</td><td>331 Million</td></tr>",
        "<td>USA</td>",
        "<td>21.43 Trillion</td>",
        "<td>331 Million</td>",
        "<tr><td>China</td><td>14.34 Trillion</td><td>1.4 Billion</td></tr>",
        "<td>China</td>",
        "<td>14.34 Trillion</td>",
        "<td>1.4 Billion</td>"
    ]
    

    How to extract metdata from weburl

    import requests
    from datetime import datetime
    from langchain.text_splitter import CharacterTextSplitter
    from bs4 import BeautifulSoup
    
    # Define a function to fetch and parse the URL to retrieve dynamic metadata
    def get_dynamic_metadata(url):
        try:
            # Fetch the HTML content of the URL
            response = requests.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
    
            # Extract title and description from the HTML
            title = soup.title.string if soup.title else "No Title"
            description = soup.find('meta', attrs={'name': 'description'}) or soup.find('meta', attrs={'property': 'og:description'})
            description_content = description['content'] if description else "No Description"
    
        except Exception as e:
            title = "Error fetching title"
            description_content = "Error fetching description"
            print(f"Error fetching data from URL: {e}")
    
        return {
            "timestamp": datetime.now().isoformat(),
            "source_url": url,
            "page_title": title,
            "page_description": description_content
        }
    
    # Sample text (normally fetched from a webpage URL)
    sample_text = """
    LangChain is a powerful library for building applications with language models. It supports text splitting, tokenization, metadata handling, and more. With LangChain, you can process large documents and work with structured text effectively. Visit our website for more detailed information.
    """
    
    # Define the URL from which you want to fetch dynamic metadata
    url = "https://example.com"  # Replace with your URL
    
    # Initialize the text splitter with the desired chunk size
    splitter = CharacterTextSplitter(chunk_size=50)
    
    # Split the sample text into chunks
    texts = splitter.split_text(sample_text)
    
    # Fetch dynamic metadata from the URL
    dynamic_metadata = get_dynamic_metadata(url)
    
    # Create documents with dynamic metadata and chunked text
    docs = [{"text": chunk, "metadata": dynamic_metadata} for chunk in texts]
    
    # Separate out the metadata from the documents into a collection
    metadata_collection = [doc["metadata"] for doc in docs]
    
    # Print the documents and the collected metadata separately
    print("Documents with Text and Metadata:")
    for doc in docs:
        print(doc)
    
    print("\nMetadata Collection (separate list):")
    for meta in metadata_collection:
        print(meta)
    

    Image description

    {
        "text": "LangChain is a powerful library for building applicatio",
        "metadata": {
            "timestamp": "2024-11-14T13:45:00",
            "source_url": "https://example.com",
            "page_title": "Example Domain",
            "page_description": "Example Domain is a web page for demonstration purposes."
        }
    },
    {
        "text": "ns with language models. It supports text splitting, tokeniz",
        "metadata": {
            "timestamp": "2024-11-14T13:45:00",
            "source_url": "https://example.com",
            "page_title": "Example Domain",
            "page_description": "Example Domain is a web page for demonstration purposes."
        }
    },
    

    ...

    SUMMARY

    Extraxt data from html data
    headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3"), ("h4", "Header 4")]
    html_header_splits = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on).split_text(html_text)
    Extraxt data from web url
    url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
    headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
    html_header_splits = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on).split_text_from_url(url)
    Splitting by Anchor Tags ==
    headers_to_split_on = [("a", "Anchor")]
     HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on).split_text_from_url(url)
    Splitting by Table Rows () for Structured Data Extraction=============
    headers_to_split_on = [("tr", "Table Row")]
     HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on).split_text_from_url(url)
    Splitting by Paragraph Tags 
    headers_to_split_on = [("p", "Paragraph")]
    HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on).split_text_from_url(url)
    Extracting Hyperlinks from Specific Sections with Headers and Anchors (h2, a)========================
    headers_to_split_on = [("h2", "Header 2"), ("a", "Anchor")]
    


  • Top comments (0)