Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

List out data Ingestion commands using Langchain framework

Load Text data

from langchain.document_loaders import TextLoader

# Example: Loading text data
loader = TextLoader('your_data_file.txt')
documents = loader.load()
raw_text = [doc.page_content for doc in documents]
Enter fullscreen mode Exit fullscreen mode

===========or======================

from langchain.document_loaders import TextLoader
# Example: Loading text data
loader = TextLoader('your_data_file.txt').load()
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Load csv data

from langchain.document_loaders import CSVLoader

# Load data from a CSV file (each row as a document)
loader = CSVLoader("path/to/your_file.csv")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load JSON File

[
    {"id": 1, "text": "This is the first document.", "author": "Alice"},
    {"id": 2, "text": "This is the second document.", "author": "Bob"},
    {"id": 3, "text": "Another document here.", "author": "Charlie"}
]
Enter fullscreen mode Exit fullscreen mode
from langchain.document_loaders import JSONLoader

# Load JSON data and extract the `text` field using jq_schema
loader = JSONLoader("example.json", jq_schema=".text")
documents = loader.load()

# Print the documents
for doc in documents:
    print(doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

This is the first document.
This is the second document.
Another document here.
Enter fullscreen mode Exit fullscreen mode

Load PDF File

from langchain.document_loaders import PyMuPDFLoader

# Load data from a PDF file
loader = PyMuPDFLoader("path/to/your_file.pdf")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Web Page by URL

from langchain.document_loaders import UnstructuredURLLoader

# Load data from a URL (web page)
urls = ["https://example.com/article"]
loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load HTML File

from langchain.document_loaders import UnstructuredHTMLLoader

# Load data from an HTML file
loader = UnstructuredHTMLLoader("path/to/your_file.html")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Text from S3 Bucket

from langchain.document_loaders import S3Loader

# Replace with your actual bucket name and key
bucket_name = "my-langchain-documents"
key = "data/sample_file.txt"

# Initialize the loader
loader = S3Loader(bucket_name=bucket_name, key=key)

# Load the documents
documents = loader.load()

# Print the loaded documents
for doc in documents:
    print(doc.page_content)

Enter fullscreen mode Exit fullscreen mode

=======OR======================

import boto3
from langchain.document_loaders import S3Loader

# Step 1: Set up AWS credentials
# Replace with your actual AWS Access Key, Secret Key, and Region
aws_access_key_id = "YOUR_AWS_ACCESS_KEY_ID"
aws_secret_access_key = "YOUR_AWS_SECRET_ACCESS_KEY"
aws_region = "YOUR_AWS_REGION"  # Example: 'us-east-1'

# Step 2: Initialize Boto3 Session (This ensures credentials are used)
session = boto3.Session(
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    region_name=aws_region,
)

# Step 3: Define S3 bucket and file key
bucket_name = "my-langchain-documents"  # Replace with your bucket name
key = "data/sample_file.txt"  # Replace with your file path in the bucket

# Step 4: Initialize LangChain's S3Loader
loader = S3Loader(
    bucket_name=bucket_name,
    key=key,
    boto3_session=session  # Pass the session with credentials
)

# Step 5: Load the document
documents = loader.load()

# Step 6: Print the content of the loaded document
for doc in documents:
    print(doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Load Data from Google Sheets

from langchain.document_loaders import GoogleSheetsLoader

# Load data from a Google Sheets document
loader = GoogleSheetsLoader(spreadsheet_id="your-spreadsheet-id", credentials_path="path/to/credentials.json")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

from langchain.document_loaders import GoogleSheetsLoader

# Step 1: Set up your spreadsheet ID and credentials path
spreadsheet_id = "your-spreadsheet-id"  # Replace with your Google Sheets ID
credentials_path = "path/to/credentials.json"  # Replace with the path to your credentials.json file

# Step 2: Initialize GoogleSheetsLoader
loader = GoogleSheetsLoader(spreadsheet_id=spreadsheet_id, credentials_path=credentials_path)

# Step 3: Load the data as documents
documents = loader.load()

# Step 4: Print the loaded documents
for doc in documents:
    print(doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Load Text from Word Document (DOCX)

from langchain.document_loaders import UnstructuredWordDocumentLoader

# Load data from a Word document
loader = UnstructuredWordDocumentLoader("path/to/your_file.docx")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Text from Email (EML) File

from langchain.document_loaders import UnstructuredEmailLoader

# Load data from an email file (.eml)
loader = UnstructuredEmailLoader("path/to/your_file.eml")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Images (OCR): Pytesseract with Pillow or UnstructuredImageLoader
Description: Requires OCR (Optical Character Recognition) tools to convert image text to strings. Pytesseract combined with Pillow or UnstructuredImageLoader can extract text from image files.
Example:

from PIL import Image
import pytesseract
image = Image.open("path/to/image.png")
text = pytesseract.image_to_string(image)
Enter fullscreen mode Exit fullscreen mode

Image description

CSVLoader: Loads structured data from CSV files, often used for tabular data.
PyMuPDFLoader: Extracts text from PDF files using the PyMuPDF library.
PDFMinerLoader: Extracts text from PDF files with detailed control over layout using PDFMiner.
UnstructuredURLLoader: Loads and extracts content from web pages at specified URLs.
UnstructuredImageLoader: Converts text in images to strings using OCR (Optical Character Recognition).
UnstructuredHTMLLoader: Loads and parses HTML files, targeting specific tags if needed.
UnstructuredWordDocumentLoader: Extracts text from .docx Microsoft Word documents.
UnstructuredEmailLoader: Loads email content from .eml files, including body and metadata.
JSONLoader: Extracts data from JSON files, allowing filtering by specific keys.
YouTubeLoader: Retrieves video transcripts and metadata from YouTube videos.
GoogleSheetsLoader: Loads spreadsheet data from Google Sheets, requiring OAuth authentication.
NotionLoader: Loads records from a Notion database using the Notion API.
MongoDBLoader: Extracts documents from MongoDB collections.
SlackLoader: Loads messages from specified Slack channels, with keyword and date filtering.
GitHubRepoLoader: Loads files from GitHub repositories.
RSSFeedLoader: Retrieves articles from RSS feeds, useful for loading updates or news.
GoogleDriveLoader: Loads files from Google Drive, requiring access via OAuth.
SQLDatabaseLoader: Loads records from SQL databases by querying specified tables.
ElasticsearchLoader: Loads documents from an Elasticsearch index for search-based data retrieval.
AirtableLoader: Loads records from Airtable databases via the Airtable API.
EmailLoader: Connects to IMAP servers to load emails directly from an inbox.

Load Data from SQL Database

from langchain.document_loaders import SQLDatabaseLoader
from sqlalchemy import create_engine

# Create an engine for a SQL database (e.g., PostgreSQL, MySQL)
engine = create_engine("postgresql://username:password@localhost/dbname")

# Load data from SQL table (specify table name or query)
loader = SQLDatabaseLoader(engine, table_name="your_table_name")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from a REST API

from langchain.document_loaders import RESTAPILoader

# Load data from a REST API endpoint
url = "https://api.example.com/data"
headers = {"Authorization": "Bearer your_token"}
loader = RESTAPILoader(url=url, headers=headers)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Text Data from Google Drive

from langchain.document_loaders import GoogleDriveLoader

# Load data from a Google Drive file (e.g., Google Docs or Google Sheets)
loader = GoogleDriveLoader(file_id="your_file_id", credentials_path="path/to/credentials.json")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from MongoDB

from langchain.document_loaders import MongoDBLoader

# Load data from a MongoDB collection
loader = MongoDBLoader(
    connection_string="mongodb://localhost:27017",
    database="your_database",
    collection="your_collection"
)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from AWS DynamoDB

from langchain.document_loaders import DynamoDBLoader

# Load data from an AWS DynamoDB table
loader = DynamoDBLoader(table_name="your_table_name", aws_access_key="your_key", aws_secret_key="your_secret")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Text from Google Cloud Storage (GCS)

from langchain.document_loaders import GCSLoader

# Load text data from a Google Cloud Storage bucket
loader = GCSLoader(bucket_name="your_bucket_name", key="path/to/your_file.txt")
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from Snowflake Database

from langchain.document_loaders import SnowflakeLoader

# Load data from a Snowflake database
loader = SnowflakeLoader(
    account="your_account",
    user="your_user",
    password="your_password",
    warehouse="your_warehouse",
    database="your_database",
    schema="your_schema",
    table="your_table"
)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from SharePoint Document

from langchain.document_loaders import SharePointLoader

# Load data from a SharePoint document or library
loader = SharePointLoader(
    site_url="https://yourcompany.sharepoint.com/sites/your_site",
    client_id="your_client_id",
    client_secret="your_client_secret",
    file_path="path/to/your_file.docx"
)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from Elasticsearch Index

from langchain.document_loaders import ElasticsearchLoader

# Load data from an Elasticsearch index
loader = ElasticsearchLoader(
    hosts=["http://localhost:9200"],
    index="your_index",
    query={"match_all": {}}
)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Load Data from an RSS Feed

from langchain.document_loaders import RSSFeedLoader

# Load data from an RSS feed
feed_url = "https://example.com/feed.rss"
loader = RSSFeedLoader(feed_url)
documents = loader.load()
Enter fullscreen mode Exit fullscreen mode

Summary
These commands cover loading text data from a diverse range of sources:

Databases (SQL, MongoDB, Snowflake, Elasticsearch)
Cloud Storage (Google Cloud Storage, SharePoint, AWS DynamoDB)
Web-Based Sources (REST API, RSS Feed, Google Drive
Enter fullscreen mode Exit fullscreen mode

Loading Research Papers from Arxiv
The ArxivLoader lets you load research papers by specifying either the query or the paper ID.

from langchain.document_loaders import ArxivLoader

# Load up to 2 documents from Arxiv for the paper ID "1706.03762" (e.g., Attention is All You Need)
docs_arxiv = ArxivLoader(query="1706.03762", load_max_docs=2).load()

# Displaying the content
for doc in docs_arxiv:
    print("Arxiv Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Loading Content from a Web Page (using WebBaseLoader)
The WebBaseLoader can load specific parts of a webpage by targeting HTML elements with certain classes using bs4.SoupStrainer. Here’s how to use it for loading sections from a webpage.


from langchain.document_loaders import WebBaseLoader
import bs4

# Load specific parts of the web page by class name
loader_web = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-title", "post-content", "post-header")))
)
docs_web = loader_web.load()

# Displaying the content
for doc in docs_web:
    print("Web Page Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Loading Content from Wikipedia
The WikipediaLoader loads summaries or sections from a Wikipedia article based on the query term.

from langchain.document_loaders import WikipediaLoader

# Load up to 2 documents from Wikipedia for the topic "Generative AI"
docs_wikipedia = WikipediaLoader(query="Generative AI", load_max_docs=2).load()

# Displaying the content
for doc in docs_wikipedia:
    print("Wikipedia Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Combining All Loaders Together
If you want to load data from all these sources at once and combine them, you can load each into a list and then combine all the documents into one list.

# Combine all documents from different sources
all_docs = docs_arxiv + docs_web + docs_wikipedia

# Displaying the combined content
for i, doc in enumerate(all_docs):
    print(f"Document {i+1} Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Summary of Each Loader Used

ArxivLoader: Loads scientific papers by query or paper ID from Arxiv.
WebBaseLoader: Loads specific content from a webpage using HTML filters.
WikipediaLoader: Loads summaries or sections of a Wikipedia article.
Enter fullscreen mode Exit fullscreen mode

Load from Notion Database

from langchain.document_loaders import NotionLoader

# Load content from a Notion database (requires integration token)
loader_notion = NotionLoader(integration_token="your_token", database_id="your_database_id")
docs_notion = loader_notion.load()
Enter fullscreen mode Exit fullscreen mode

Load from Google Docs

from langchain.document_loaders import GoogleDriveLoader

# Load a Google Docs file by its ID (requires Google OAuth credentials)
loader_gdoc = GoogleDriveLoader(file_id="your_file_id", credentials_path="path/to/credentials.json")
docs_gdoc = loader_gdoc.load()
Enter fullscreen mode Exit fullscreen mode

Load from YouTube Transcript

from langchain.document_loaders import YouTubeLoader

# Load transcript from a YouTube video (requires video ID)
docs_youtube = YouTubeLoader(video_id="your_video_id").load()
Enter fullscreen mode Exit fullscreen mode

Load from a Slack Channel

from langchain.document_loaders import SlackLoader

# Load messages from a Slack channel (requires API token)
loader_slack = SlackLoader(api_token="your_slack_token", channel_id="your_channel_id")
docs_slack = loader_slack.load()
Enter fullscreen mode Exit fullscreen mode

Load from GitHub Repository Files

from langchain.document_loaders import GitHubRepoLoader

# Load files from a GitHub repository (requires access token)
loader_github = GitHubRepoLoader(repo_url="https://github.com/username/repo", access_token="your_access_token")
docs_github = loader_github.load()
Enter fullscreen mode Exit fullscreen mode

Load from an RSS Feed

from langchain.document_loaders import RSSFeedLoader

# Load recent articles from an RSS feed
feed_url = "https://example.com/rss"
loader_rss = RSSFeedLoader(feed_url)
docs_rss = loader_rss.load()
Enter fullscreen mode Exit fullscreen mode

*Load from Airtable Database
*

from langchain.document_loaders import AirtableLoader

# Load records from an Airtable table (requires API key and table ID)
loader_airtable = AirtableLoader(api_key="your_api_key", base_id="your_base_id", table_name="your_table_name")
docs_airtable = loader_airtable.load()
Enter fullscreen mode Exit fullscreen mode

Example: Combining All Loaded Documents
After loading documents from each source, you may want to combine them into a single list for further processing.

# Combine all documents into one list
all_docs =  docs_notion + docs_gdoc + docs_youtube + docs_slack + docs_github + docs_rss + docs_airtable

# Displaying content from each document
for i, doc in enumerate(all_docs):
    print(f"Document {i+1} Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Summary of Sources

Arxiv - Research papers by query or ID.
Web Page - Specific parts of a web page.
Wikipedia - Wikipedia article sections.
Notion - Notion database entries.
Google Docs - Google Docs file content.
YouTube - Transcript from YouTube videos.
Slack - Messages from a Slack channel.
GitHub - Files from a GitHub repository.
RSS Feed - Articles from an RSS feed.
Airtable - Records from an Airtable table.
Enter fullscreen mode Exit fullscreen mode

Extract data specific part from data sources

data = [
    {
        'title': 'Title of the paper',
        'authors': ['Author 1', 'Author 2'],
        'abstract': 'This is the abstract of the paper.',
        'published': '2017-06-01',
        'url': 'https://arxiv.org/abs/1706.03762'
    },
    {
        'title': 'Title of related paper 1',
        'authors': ['Author 1', 'Author 2'],
        'abstract': 'Abstract for related paper 1.',
        'published': '2017-06-02',
        'url': 'https://arxiv.org/abs/1706.03763'
    },
    {
        'title': 'Title of related paper 2',
        'authors': ['Author 3', 'Author 4'],
        'abstract': 'Abstract for related paper 2.',
        'published': '2017-06-03',
        'url': 'https://arxiv.org/abs/1706.03764'
    }
]
Enter fullscreen mode Exit fullscreen mode

all the documents published in 2017

[doc for doc in data if datetime.strptime(doc['published'], '%Y-%m-%d').year == year]
[doc for doc in data if datetime.strptime(doc['published'], '%Y-%m-%d').year == '2017']
Enter fullscreen mode Exit fullscreen mode

documents where "ram" is listed as one of the authors

[doc for doc in data if author_name in doc['authors']]
[doc for doc in data if 'ram'in doc['authors']]
Enter fullscreen mode Exit fullscreen mode

Load Only Titles from Arxiv Papers

from langchain.document_loaders import ArxivLoader

# Load Arxiv documents and extract only titles
docs_arxiv = ArxivLoader(query="1706.03762", load_max_docs=3).load()
titles = [doc.metadata['title'] for doc in docs_arxiv]
print("Arxiv Titles:", titles)
Enter fullscreen mode Exit fullscreen mode

Load Only Content Body from Specific Web Page Sections

from langchain.document_loaders import WebBaseLoader
import bs4

# Load specific parts of a web page (e.g., only main content sections)
loader_web = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="post-content"))
)
docs_web = loader_web.load()
content_sections = [doc.page_content for doc in docs_web]
print("Web Content Sections:", content_sections)
Enter fullscreen mode Exit fullscreen mode

Load Only Summaries from Wikipedia Articles

docs_wikipedia = [
    {
        "page_content": "This is the detailed content of the Wikipedia page on the subject.",
        "metadata": {"section": "Summary"}
    },
    {
        "page_content": "This content covers another aspect of the topic in more detail.",
        "metadata": {"section": "Introduction"}
    },
    {
        "page_content": "This is the section providing a summary of the article, focusing on key facts.",
        "metadata": {"section": "Summary"}
    }
]
Enter fullscreen mode Exit fullscreen mode
from langchain.document_loaders import WikipediaLoader

# Load Wikipedia articles and extract summaries
docs_wikipedia = WikipediaLoader(query="Generative AI", load_max_docs=2).load()
summaries = [doc.page_content for doc in docs_wikipedia if 'summary' in doc.metadata.get('section', '').lower()]
print("Wikipedia Summaries:", summaries)
Enter fullscreen mode Exit fullscreen mode

Load Only Specific Column from Google Sheets

docs_gsheet = [
    {
        "page_content": {"A": "This is the content for A", "B": "Content for B"}
    },
    {
        "page_content": {"A": "Another content for A", "C": "Content for C"}
    },
    {
        "page_content": {"B": "Only content for B"}
    }
]
Enter fullscreen mode Exit fullscreen mode
from langchain.document_loaders import GoogleSheetsLoader

# Load specific column (e.g., 'A' column) from a Google Sheets document
loader_gsheet = GoogleSheetsLoader(spreadsheet_id="your_spreadsheet_id", credentials_path="path/to/credentials.json")
docs_gsheet = loader_gsheet.load()
column_a = [doc.page_content['A'] for doc in docs_gsheet if 'A' in doc.page_content]
print("Column A Values:", column_a)
Enter fullscreen mode Exit fullscreen mode

Extract Transcripts from YouTube by Section

docs_youtube = [
    {
        "page_content": "Introduction to Artificial Intelligence.\nAI is revolutionizing industries.\nMachine Learning is a part of AI."
    },
    {
        "page_content": "Deep Learning is a subset of AI.\nAI in healthcare.\nAI ethics and privacy concerns."
    },
    {
        "page_content": "This document does not mention the keyword."
    }
]
Enter fullscreen mode Exit fullscreen mode
from langchain.document_loaders import YouTubeLoader

# Load transcripts and extract specific sections by keywords
docs_youtube = YouTubeLoader(video_id="your_video_id").load()
important_sections = [line for doc in docs_youtube for line in doc.page_content.split('\n') if "AI" in line]
print("Important Transcript Sections:", important_sections)
Enter fullscreen mode Exit fullscreen mode

Extract Messages with Specific Keywords from Slack

from langchain.document_loaders import SlackLoader

# Load Slack messages containing specific keywords
loader_slack = SlackLoader(api_token="your_slack_token", channel_id="your_channel_id")
docs_slack = loader_slack.load()
keyword_messages = [doc.page_content for doc in docs_slack if "update" in doc.page_content.lower()]
print("Keyword Messages:", keyword_messages)
Enter fullscreen mode Exit fullscreen mode

output
[
"Introduction to Artificial Intelligence.",
"AI is revolutionizing industries.",
"Machine Learning is a part of AI.",
"Deep Learning is a subset of AI.",
"AI in healthcare.",
"AI ethics and privacy concerns."
]

Load Files from GitHub Repo by File Type (e.g., .md files)

from langchain.document_loaders import GitHubRepoLoader

# Load files from GitHub and filter for only Markdown files
loader_github = GitHubRepoLoader(repo_url="https://github.com/username/repo", access_token="your_access_token")
docs_github = loader_github.load()
markdown_files = [doc.page_content for doc in docs_github if doc.metadata['file_path'].endswith('.md')]
print("Markdown Files Content:", markdown_files)
Enter fullscreen mode Exit fullscreen mode

Load RSS Feed Titles Only

from langchain.document_loaders import RSSFeedLoader

# Load only the titles of articles from an RSS feed
feed_url = "https://example.com/rss"
loader_rss = RSSFeedLoader(feed_url)
docs_rss = loader_rss.load()
titles = [doc.metadata['title'] for doc in docs_rss]
print("RSS Feed Titles:", titles)
Enter fullscreen mode Exit fullscreen mode

Load Specific Fields from Airtable Records (e.g., "Name" field)

from langchain.document_loaders import AirtableLoader

# Load only specific fields from Airtable records
loader_airtable = AirtableLoader(api_key="your_api_key", base_id="your_base_id", table_name="your_table_name")
docs_airtable = loader_airtable.load()
names = [doc.page_content['Name'] for doc in docs_airtable if 'Name' in doc.page_content]
print("Names from Airtable:", names)
Enter fullscreen mode Exit fullscreen mode

Load Specific Headings from PDF Content

from langchain.document_loaders import PyMuPDFLoader

# Load specific headings from a PDF file based on keyword search
loader_pdf = PyMuPDFLoader("path/to/your_file.pdf")
docs_pdf = loader_pdf.load()
headings = [line for doc in docs_pdf for line in doc.page_content.split('\n') if line.strip().endswith(':')]
print("PDF Headings:", headings)
Enter fullscreen mode Exit fullscreen mode

Explanation and Use Cases
Filtering by Field or Metadata: Examples 1, 3, 4, 7, and 9 filter specific fields like titles, summaries, or names based on metadata.
Keyword Extraction: Examples 5, 6, and 10 focus on content sections with certain keywords or formats.
Specific Section Loading: Examples 2 and 8 load only the desired section, such as titles or main content, by targeting specific HTML elements or XML tags.

Extract website part using beautifull soup library

Load Only Main Content by Class Name

from langchain.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="main-content"))
)
docs_main_content = loader.load()
for doc in docs_main_content:
    print("Main Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Load Only Article Titles and Content by Multiple Classes

loader = WebBaseLoader(
    web_paths=["https://example.com/articles"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("article-title", "article-content")))
)
docs_articles = loader.load()
for doc in docs_articles:
    print("Titles and Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Load Only Header and Footer Sections

loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(["header", "footer"]))
)
docs_header_footer = loader.load()
for doc in docs_header_footer:
    print("Header and Footer:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Load Only Links in Navigation Bar

loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer("a", class_="navbar"))
)
docs_nav_links = loader.load()
nav_links = [a['href'] for doc in docs_nav_links for a in BeautifulSoup(doc.page_content, "html.parser").find_all("a", href=True)]
print("Navbar Links:", nav_links)
Enter fullscreen mode Exit fullscreen mode

Load Only Text from a JavaScript-Rendered Table

from selenium import webdriver
from langchain.document_loaders import WebBaseLoader
import bs4

driver = webdriver.Chrome()
driver.get("https://example.com/table-page")
loader = WebBaseLoader(
    web_paths=[driver.page_source],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer("table"))
)
docs_table = loader.load()
for doc in docs_table:
    print("Table Data:", doc.page_content)
driver.quit()
Enter fullscreen mode Exit fullscreen mode

Load Only Image URLs

loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer("img"))
)
docs_images = loader.load()
image_urls = [img['src'] for doc in docs_images for img in BeautifulSoup(doc.page_content, "html.parser").find_all("img", src=True)]
print("Image URLs:", image_urls)
Enter fullscreen mode Exit fullscreen mode

Load Only Text from Multiple Pages in a Loop

urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
loader = WebBaseLoader(
    web_paths=urls,
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(["p", "h1", "h2"]))
)
docs_multiple_pages = loader.load()
for i, doc in enumerate(docs_multiple_pages):
    print(f"Content of Page {i+1}:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Load Only Elements with Specific Attributes

loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(attrs={"id": "specific-id"}))
)
docs_specific_id = loader.load()
for doc in docs_specific_id:
    print("Elements with Specific ID:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Click Button and Extract Updated Content (Using Selenium)

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get("https://example.com")

# Click a button to load new content
button = driver.find_element(By.ID, "load-more-button")
button.click()
time.sleep(2)  # Wait for content to load

loader = WebBaseLoader(
    web_paths=[driver.page_source],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="dynamic-content"))
)
docs_dynamic = loader.load()
for doc in docs_dynamic:
    print("Updated Content:", doc.page_content)

driver.quit()
Enter fullscreen mode Exit fullscreen mode

Load Only Dropdown Options

from selenium import webdriver
from selenium.webdriver.support.ui import Select

driver = webdriver.Chrome()
driver.get("https://example.com")

dropdown = Select(driver.find_element(By.ID, "dropdown-menu"))
options = [option.text for option in dropdown.options]
print("Dropdown Options:", options)

# Load the page source to parse any other relevant data from the dropdown
loader = WebBaseLoader(
    web_paths=[driver.page_source],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer("select"))
)
docs_dropdown = loader.load()
for doc in docs_dropdown:
    print("Dropdown Options HTML:", doc.page_content)

driver.quit()
Enter fullscreen mode Exit fullscreen mode

Summary
Each example here combines LangChain’s WebBaseLoader with BeautifulSoup’s SoupStrainer to filter for specific parts of a web page:
Specific classes, tags, or IDs
Links, images, or table data
Content from dynamic sections loaded with JavaScript (using Selenium)

Extract Wikipedia section

WikipediaLoader with Higher Document Limit

from langchain.document_loaders import WikipediaLoader

# Load up to 5 documents about "Machine Learning"
docs = WikipediaLoader(query="Machine Learning", load_max_docs=5).load()
Enter fullscreen mode Exit fullscreen mode

WikipediaLoader with Different Query Term

# Load up to 2 documents on "Artificial Intelligence"
docs = WikipediaLoader(query="Artificial Intelligence", load_max_docs=2).load()
Enter fullscreen mode Exit fullscreen mode

WikipediaLoader with Summary Sections Only

# Load only summary sections of "Deep Learning" with 3 documents max
docs = WikipediaLoader(query="Deep Learning", load_max_docs=3, include_summaries=True).load()
Enter fullscreen mode Exit fullscreen mode

WikipediaLoader with Subsection Inclusion

# Load up to 4 documents, including subsections of "Neural Networks"
docs = WikipediaLoader(query="Neural Networks", load_max_docs=4, include_subsections=True).load()
Enter fullscreen mode Exit fullscreen mode

WebBaseLoader with Different Web Page and Class Filter

from langchain.document_loaders import WebBaseLoader
import bs4

# Load only the title and content from another blog page
loader = WebBaseLoader(
    web_paths=["https://example.com/another-post"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-title", "post-content")))
)
docs = loader.load()
Enter fullscreen mode Exit fullscreen mode

WebBaseLoader with Specific Tags (Header and Footer Only)

# Load only header and footer sections
loader = WebBaseLoader(
    web_paths=["https://example.com/page"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(["header", "footer"]))
)
docs = loader.load()
Enter fullscreen mode Exit fullscreen mode

WebBaseLoader with Multiple Pages

# Load from multiple web pages, extracting only article content
loader = WebBaseLoader(
    web_paths=["https://example.com/article1", "https://example.com/article2"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_="article-content"))
)
docs = loader.load()
Enter fullscreen mode Exit fullscreen mode

WebBaseLoader with ID Filter

# Load a specific section identified by an HTML ID
loader = WebBaseLoader(
    web_paths=["https://example.com/special-article"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(id="important-section"))
)
docs = loader.load()
Enter fullscreen mode Exit fullscreen mode

WebBaseLoader with Multiple Classes and Custom Tags

# Load post title, date, and content from a blog post
loader = WebBaseLoader(
    web_paths=["https://blog.example.com/post"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-title", "post-date", "post-content")))
)
docs = loader.load()
Enter fullscreen mode Exit fullscreen mode

WikipediaLoader with Content Sections Only

# Load content sections only from "Large Language Models" page, max 3 documents
docs = WikipediaLoader(query="Large Language Models", load_max_docs=3, include_summaries=False
Enter fullscreen mode Exit fullscreen mode

Common Parameters Across Different Data Loaders

query(WikipediaLoader, ArxivLoader, etc.)

Description: Specifies the search term or query to retrieve specific documents.
Example: query="Machine Learning"
load_max_docs(WikipediaLoader, ArxivLoader, GoogleSheetsLoader, etc.)

Description: Limits the maximum number of documents to load.
Example: load_max_docs=5
include_summaries(WikipediaLoader)

Description: If True, only summary sections are loaded.
Example: include_summaries=True
include_subsections(WikipediaLoader, ArxivLoader)

Description: Loads subsections in addition to main sections.
Example: include_subsections=True
web_paths(WebBaseLoader)

Description: Specifies a list of URLs to load content from.
Example: web_paths=["https://example.com/page1", "https://example.com/page2"]
bs_kwargs(WebBaseLoader)

Description: Dictionary of BeautifulSoup options for filtering specific sections of HTML.
Example: bs_kwargs=dict(parse_only=bs4.SoupStrainer("div", class_="content"))
parse_only(used within bs_kwargs in WebBaseLoader)

Description: Filters HTML to parse only specific tags, classes, or IDs.
Example: parse_only=SoupStrainer(["h1", "h2", "p"])
output_format(Some loaders like JSONLoader or CSVLoader)

Description: Specifies the format of the output (e.g., dict, json).
Example: output_format="json"
table_name(SQLDatabaseLoader, AirtableLoader)

Description: The name of the table to query or load data from.
Example: table_name="users"
credentials_path (GoogleSheetsLoader, GoogleDriveLoader)

Description: Path to the credentials file for Google-based loaders.
Example: credentials_path="path/to/credentials.json"
Advanced Parameters for Specific Data Loaders
column_names(CSVLoader, GoogleSheetsLoader)

Description: Specifies which columns to load from the CSV or Google Sheets document.
Example: column_names=["Title", "Summary"]
api_key(Some loaders like AirtableLoader, SlackLoader, and OpenAI-based loaders)

Description: API key for authentication with third-party services.
Example: api_key="your_api_key_here"
file_id(GoogleDriveLoader, YouTubeLoader)

Description: The ID of the file to load from Google Drive or YouTube.
Example: file_id="your_file_id"
database_id (NotionLoader, AirtableLoader)

Description: ID of the database to load content from.
Example: database_id="notion_database_id"
section_filter(WikipediaLoader, ArxivLoader)

Description: Filters specific sections within a document (e.g., "Abstract," "Introduction").
Example: section_filter=["Introduction", "Conclusion"]
keywords(SlackLoader, RSSFeedLoader)

Description: Filters messages or articles based on specific keywords.
Example: keywords=["update", "meeting"]
chunk_size(TextLoader, TextSplitter, other loaders)

Description: Specifies the number of characters or tokens per chunk when splitting text.
Example: chunk_size=500
content_filter(WikipediaLoader, WebBaseLoader)

Description: Filters content based on keywords or topics.
Example: content_filter="AI"
load_comments(RedditLoader, DiscourseLoader)

Description: If True, loads comments or responses in addition to main content.
Example: load_comments=True
time_range(RSSFeedLoader, TwitterLoader)

Description: Specifies a time range to filter content by date.
Example: time_range=("2023-01-01", "2023-12-31")

query (WikipediaLoader, ArxivLoader, etc.)
Description: Specifies the topic or search term for documents.
Example: Loading documents on "Machine Learning."

from langchain.document_loaders import WikipediaLoader

docs = WikipediaLoader(query="Machine Learning", load_max_docs=1).load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Machine learning is a field of computer science focused on enabling computers to learn from data..."
Enter fullscreen mode Exit fullscreen mode

load_max_docs (WikipediaLoader, ArxivLoader, etc.)
Description: Limits the number of documents loaded.
Example: Loading 2 documents on "Artificial Intelligence."

docs = WikipediaLoader(query="Artificial Intelligence", load_max_docs=2).load()
print(len(docs))
Enter fullscreen mode Exit fullscreen mode

Output:

2  # Two documents loaded.
Enter fullscreen mode Exit fullscreen mode

include_summaries (WikipediaLoader)

Description: Loads only summary sections if True.
Example: Loading summaries of "Deep Learning."

docs = WikipediaLoader(query="Deep Learning", load_max_docs=1, include_summaries=True).load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Deep learning is a subset of machine learning that uses neural networks with many layers..."
Enter fullscreen mode Exit fullscreen mode

include_subsections (WikipediaLoader)
Description: Loads subsections within the main document.
Example: Loading document on "Neural Networks" with subsections.

docs = WikipediaLoader(query="Neural Networks", load_max_docs=1, include_subsections=True).load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Neural Networks: ...\nApplications: ...\nTypes of Networks: ..."
Enter fullscreen mode Exit fullscreen mode

web_paths (WebBaseLoader)
Description: Specifies URLs to load data from.
Example: Loading content from a webpage.

from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_paths=["https://example.com"])
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Welcome to Example.com! Here you'll find articles and resources on various topics..."
Enter fullscreen mode Exit fullscreen mode

bs_kwargs (WebBaseLoader)
Description: BeautifulSoup options for filtering HTML content.
Example: Extracting only paragraph (p) tags.

import bs4

loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer("p"))
)
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"This is the main content of the webpage in paragraph form."

  1. parse_only (within bs_kwargs) Description: Filters HTML by specific tags or classes. Example: Loading content from headers and paragraphs only.
loader = WebBaseLoader(
    web_paths=["https://example.com"],
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(["h1", "p"]))
)
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Example.com - Your source for news.\nToday's main article discusses..."
Enter fullscreen mode Exit fullscreen mode

output_format (JSONLoader, CSVLoader, etc.)
Description: Specifies output format, like dict or json.
Example: Loading a JSON document.

from langchain.document_loaders import JSONLoader

loader = JSONLoader("path/to/data.json", output_format="dict")
docs = loader.load()

print(docs)
Enter fullscreen mode Exit fullscreen mode

Output:


{'title': 'Introduction to JSON', 'content': 'JSON is a lightweight data format...'}
Enter fullscreen mode Exit fullscreen mode

table_name (SQLDatabaseLoader, AirtableLoader)
Description: The name of the table to load data from.
Example: Loading data from a "users" table in SQL.

from langchain.document_loaders import SQLDatabaseLoader
from sqlalchemy import create_engine

engine = create_engine("sqlite:///mydatabase.db")
loader = SQLDatabaseLoader(engine, table_name="users")
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

{'id': 1, 'name': 'John Doe', 'email': 'john@example.com'}
Enter fullscreen mode Exit fullscreen mode

credentials_path (GoogleSheetsLoader, GoogleDriveLoader)
Description: Path to the credentials file for Google services.
Example: Loading data from Google Sheets.

from langchain.document_loaders import GoogleSheetsLoader

loader = GoogleSheetsLoader(spreadsheet_id="your_spreadsheet_id", credentials_path="path/to/credentials.json")
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

{'Name': 'Alice', 'Age': '30', 'Occupation': 'Engineer'}
Enter fullscreen mode Exit fullscreen mode

column_names (CSVLoader, GoogleSheetsLoader)
Description: Specifies which columns to load.
Example: Loading only "Title" and "Summary" columns.

from langchain.document_loaders import CSVLoader

loader = CSVLoader("data.csv", column_names=["Title", "Summary"])
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

{'Title': 'Introduction to AI', 'Summary': 'This article covers basics of AI...'}
Enter fullscreen mode Exit fullscreen mode

api_key (AirtableLoader, SlackLoader)
Description: API key for authentication with services.
Example: Loading records from Airtable.

from langchain.document_loaders import AirtableLoader

loader = AirtableLoader(api_key="your_api_key", base_id="app123456", table_name="Projects")
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

{'Project Name': 'AI Research', 'Status': 'In Progress'}
Enter fullscreen mode Exit fullscreen mode

file_id (GoogleDriveLoader, YouTubeLoader)
Description: The ID of the file to load (e.g., Google Drive).
Example: Loading a file from Google Drive.

from langchain.document_loaders import GoogleDriveLoader

loader = GoogleDriveLoader(file_id="your_file_id", credentials_path="path/to/credentials.json")
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"This is the document content from Google Drive."
database_id (NotionLoader)
Description: ID of the database to load from (Notion).
Example: Loading records from a Notion database.

from langchain.document_loaders import NotionLoader

loader = NotionLoader(database_id="your_notion_database_id", integration_token="your_integration_token")
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

{'Task': 'Complete Project', 'Status': 'In Progress'}
Enter fullscreen mode Exit fullscreen mode

section_filter (WikipediaLoader, ArxivLoader)
Description: Filters specific sections like "Abstract."
Example: Loading only "Introduction" and "Conclusion."

docs = WikipediaLoader(query="Machine Learning", load_max_docs=1, section_filter=["Introduction", "Conclusion"]).load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Introduction: Machine learning is... Conclusion: In summary, machine learning..."
Enter fullscreen mode Exit fullscreen mode

keywords (SlackLoader)
Description: Filters messages based on keywords.
Example: Loading messages with "meeting."

from langchain.document_loaders import SlackLoader

loader = SlackLoader(api_token="your_slack_token", channel_id="your_channel_id", keywords=["meeting"])
docs = loader.load()
print(docs[0].page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"We have a team meeting scheduled for tomorrow."
Enter fullscreen mode Exit fullscreen mode

chunk_size (TextLoader)
Description: Specifies chunk size for splitting text.
Example: Splitting text into 300-character chunks.

from langchain.text_splitter import CharacterTextSplitter

text = "This is a long document that needs to be split into smaller chunks..."
splitter = CharacterTextSplitter(chunk_size=300)
chunks = splitter.split_text(text)
print(chunks)
Enter fullscreen mode Exit fullscreen mode

Output:

["This is a long document that needs...", "Here is another chunk of text..."]
Enter fullscreen mode Exit fullscreen mode

Extract you tube data

Load Only the Video Transcript
Description: Loads only the video transcript.
Example: Loading the transcript of a YouTube video.

from langchain.document_loaders import YouTubeLoader

loader = YouTubeLoader(video_id="your_video_id", load_transcript=True)
docs = loader.load()
for doc in docs:
    print("Transcript Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"In this video, we’ll explore the basics of machine learning, starting with supervised learning..."
Enter fullscreen mode Exit fullscreen mode

Load Video Transcript and Comments
Description: Loads both transcript and comments if load_comments=True.
Example: Loading transcript and top comments.

loader = YouTubeLoader(video_id="your_video_id", load_transcript=True, load_comments=True)
docs = loader.load()
for doc in docs:
    print("Transcript:", doc.page_content)
    for comment in doc.metadata.get("comments", []):
        print("Comment:", comment)
Enter fullscreen mode Exit fullscreen mode

Output:

Transcript: "In this video, we’ll explore the basics of machine learning..."
Comment: "Great explanation on supervised learning!"
Comment: "Could you make a video on reinforcement learning next?"
Enter fullscreen mode Exit fullscreen mode

Load Transcript with Specific Keyword Filter
Description: Loads only parts of the transcript containing certain keywords.
Example: Filtering transcript for sections mentioning "AI."

loader = YouTubeLoader(video_id="your_video_id", load_transcript=True, content_filter="AI")
docs = loader.load()
for doc in docs:
    print("AI Mentions in Transcript:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"AI is transforming industries by enabling machines to make decisions autonomously..."
Enter fullscreen mode Exit fullscreen mode

Load Video Metadata Only
Description: Loads metadata such as title, description, views, and likes.
Example: Loading only metadata without transcript or comments.

loader = YouTubeLoader(video_id="your_video_id", load_transcript=False, load_comments=False, load_metadata=True)
docs = loader.load()
for doc in docs:
    print("Video Title:", doc.metadata["title"])
    print("Description:", doc.metadata["description"])
    print("Views:", doc.metadata["views"])
    print("Likes:", doc.metadata["likes"])
Enter fullscreen mode Exit fullscreen mode

Output:

Video Title: "Introduction to Machine Learning"
Description: "This video covers the basics of machine learning."
Views: 15230
Likes: 1345
Enter fullscreen mode Exit fullscreen mode

Load Comments within a Specific Time Range
Description: Filters comments by a specific time range.
Example: Loading comments posted between June and August 2023.

loader = YouTubeLoader(video_id="your_video_id", load_comments=True, time_range=("2023-06-01", "2023-08-31"))
docs = loader.load()
for doc in docs:
    print("Transcript:", doc.page_content)
    for comment in doc.metadata.get("comments", []):
        print("Comment:", comment)
Enter fullscreen mode Exit fullscreen mode

Output:

Comment: "Fantastic explanation, learned a lot from this video!"  # Comment posted in the specified time range.
Enter fullscreen mode Exit fullscreen mode

Load Transcript in a Specific Language
Description: Loads the transcript in a specified language.
Example: Loading the transcript in French (if available).

loader = YouTubeLoader(video_id="your_video_id", load_transcript=True, language="fr")
docs = loader.load()
for doc in docs:
    print("French Transcript:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Dans cette vidéo, nous explorons les bases de l'apprentissage automatique..."
Enter fullscreen mode Exit fullscreen mode

Load Only Transcript Sections with Specific Phrases
Description: Filters transcript to include only specific phrases or sections.
Example: Extracting sections mentioning "neural networks."

loader = YouTubeLoader(video_id="your_video_id", load_transcript=True, content_filter="neural networks")
docs = loader.load()
for doc in docs:
    print("Neural Network Mentions:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Output:

"Neural networks are a type of machine learning model that mimics the human brain..."
Enter fullscreen mode Exit fullscreen mode

Load Transcript and Like/Dislike Counts in Metadata
Description: Loads transcript along with like and dislike counts from metadata.
Example: Loading transcript with metadata fields.

loader = YouTubeLoader(video_id="your_video_id", load_transcript=True, load_metadata=True)
docs = loader.load()
for doc in docs:
    print("Transcript:", doc.page_content)
    print("Likes:", doc.metadata["likes"])
    print("Dislikes:", doc.metadata["dislikes"])
Enter fullscreen mode Exit fullscreen mode

Output:

Transcript: "In this video, we explore various types of AI models..."
Likes: 1500
Dislikes: 30
Enter fullscreen mode Exit fullscreen mode

Load Top-Level Comments Only
Description: Loads only top-level comments (no replies).
Example: Loading only top-level comments without transcript.

loader = YouTubeLoader(video_id="your_video_id", load_transcript=False, load_comments=True)
docs = loader.load()
for doc in docs:
    for comment in doc.metadata.get("comments", []):
        print("Top-Level Comment:", comment)
Enter fullscreen mode Exit fullscreen mode

Output:

Top-Level Comment: "Great video, very informative!"
Enter fullscreen mode Exit fullscreen mode

Load Only Replies to a Specific Comment
Description: Loads replies to a specific comment if load_comments=True.
Example: Loading replies to a specific comment on the video.

loader = YouTubeLoader(video_id="your_video_id", load_transcript=False, load_comments=True)
docs = loader.load()
for doc in docs:
    for comment in doc.metadata.get("comments", []):
        print("Comment:", comment["text"])
        for reply in comment.get("replies", []):
            print("Reply:", reply)
Enter fullscreen mode Exit fullscreen mode

Output:

Comment: "Could you explain reinforcement learning in detail?"
Reply: "Sure! Reinforcement learning is a type of machine learning where agents learn by interacting with an environment."
Summary
video_id: Unique ID of the YouTube video.
load_transcript: Loads the video transcript.
load_comments: Loads comments, including replies.
time_range: Filters comments by date.
language: Specifies the transcript language.
content_filter: Filters for specific keywords in the transcript.
load_metadata: Loads metadata fields (e.g., title, views, likes).
Enter fullscreen mode Exit fullscreen mode

SUMMARY

TextLoader('speech.txt').loader.load()||CSVLoader||PyMuPDFLoader
Enter fullscreen mode Exit fullscreen mode
JSONLoader("example.json", jq_schema=".text").load()
for doc in documents:==>print(doc.page_content)
Enter fullscreen mode Exit fullscreen mode
UnstructuredURLLoader(urls=urls).load()
UnstructuredHTMLLoader("path/to/your_file.html")
UnstructuredWordDocumentLoader("path/to/your_file.docx")
UnstructuredEmailLoader("path/to/your_file.eml")
Enter fullscreen mode Exit fullscreen mode
S3Loader(bucket_name="your-bucket-name", key="path/to/your_file.txt")
Enter fullscreen mode Exit fullscreen mode
GoogleSheetsLoader(spreadsheet_id="your-spreadsheet-id", credentials_path="path/to/credentials.json")
Enter fullscreen mode Exit fullscreen mode
Image.open("path/to/image.png")
Enter fullscreen mode Exit fullscreen mode
SQLDatabaseLoader(engine, table_name="your_table_name")
Enter fullscreen mode Exit fullscreen mode
RESTAPILoader(url=url, headers=headers)
Enter fullscreen mode Exit fullscreen mode
GoogleDriveLoader(file_id="your_file_id", credentials_path="path/to/credentials.json")
Enter fullscreen mode Exit fullscreen mode
MongoDBLoader(connection_string="mongodb://localhost:27017",database="your_database",collection="your_collection")
Enter fullscreen mode Exit fullscreen mode
DynamoDBLoader(table_name="your_table_name", aws_access_key="your_key", aws_secret_key="your_secret"GCSLoader)

Enter fullscreen mode Exit fullscreen mode
||SnowflakeLoader||ElasticsearchLoader||RSSFeedLoader||WebBaseLoader||WikipediaLoader||YouTubeLoader||SlackLoader
Enter fullscreen mode Exit fullscreen mode
docs_arxiv = ArxivLoader(query="1706.03762", load_max_docs=2).load()
Enter fullscreen mode Exit fullscreen mode
alldoc = docs_arxiv + docs_web + docs_wikipedia
for i, doc in enumerate(alldoc):==>print(f"Document {i+1} Content:", doc.page_content)
Enter fullscreen mode Exit fullscreen mode

Extract data specific part from data sources================
docs_arxiv = ArxivLoader(query="1706.03762", load_max_docs=2).load()

docs_arxiv = [
    {
        "metadata": {
            "title": "A Comprehensive Study on AI",
            "authors": ["Author 1", "Author 2"],
            "published": "2024-11-01"
        },
        "content": "Detailed content of the paper about AI..."
    },
    {
        "metadata": {
            "title": "Introduction to Machine Learning",
            "authors": ["Author 3", "Author 4"],
            "published": "2023-06-15"
        },
        "content": "Detailed content about machine learning concepts..."
    }
]
Enter fullscreen mode Exit fullscreen mode
titles = [doc.metadata['title'] for doc in docs_arxiv]
Enter fullscreen mode Exit fullscreen mode
[doc.page_content for doc in docs_wikipedia if 'summary' in doc.metadata.get('section', '').lower()]
Enter fullscreen mode Exit fullscreen mode
docs_gsheet = [
    {
        "page_content": {"A": "This is the content for A", "B": "Content for B"}
    },
    {
        "page_content": {"A": "Another content for A", "C": "Content for C"}
    },
    {
        "page_content": {"B": "Only content for B"}
    }
]
Enter fullscreen mode Exit fullscreen mode
[doc.page_content['A'] for doc in docs_gsheet if 'A' in doc.page_content]
Enter fullscreen mode Exit fullscreen mode
docs_youtube = [
    {
        "page_content": "Introduction to Artificial Intelligence.\nAI is revolutionizing industries.\nMachine Learning is a part of AI."
    },
    {
        "page_content": "Deep Learning is a subset of AI.\nAI in healthcare.\nAI ethics and privacy concerns."
    },
    {
        "page_content": "This document does not mention the keyword."
    }
]
Enter fullscreen mode Exit fullscreen mode
important_sections = [line for doc in docs_youtube for line in doc.page_content.split('\n') if "AI" in line]
Enter fullscreen mode Exit fullscreen mode

output

[
    "Introduction to Artificial Intelligence.",
    "AI is revolutionizing industries.",
    "Machine Learning is a part of AI.",
    "Deep Learning is a subset of AI.",
    "AI in healthcare.",
    "AI ethics and privacy concerns."
]
Enter fullscreen mode Exit fullscreen mode
docs_github = [
    {
        'metadata': {'file_path': 'docs/intro.md'},
        'page_content': 'This is the content of the markdown file.'
    },
    {
        'metadata': {'file_path': 'docs/guide.txt'},
        'page_content': 'This is a text file.'
    }
]
Enter fullscreen mode Exit fullscreen mode
markdown_files = [doc.page_content for doc in docs_github if doc.metadata['file_path'].endswith('.md')]
Enter fullscreen mode Exit fullscreen mode
docs_pdf = [
    {
        'page_content': 'This is a title:\nThis is a description.'
    },
    {
        'page_content': 'Another heading:\nThis is more content.'
    }
]
Enter fullscreen mode Exit fullscreen mode
headings = [line for doc in docs_pdf for line in doc.page_content.split('\n') if line.strip().endswith(':')]
Enter fullscreen mode Exit fullscreen mode

output

[
    'This is a title:',
    'Another heading:'
]
Enter fullscreen mode Exit fullscreen mode

Extract website part using beautifull soup library==============
Main Content by Class Name||Only Article Titles and Content by Multiple Classes|| Header and Footer Sections
Links in Navigation Bar|| Image URLs||Text from Multiple Pages in a Loop(h1,p)||Specific Attributes||dropdown||

Top comments (0)