rakesh kumar

Posted on Nov 28, 2024 • Edited on Dec 16, 2024

Automating tasks by processing text using NLP pipeline

How NLP automate task
Applications of NLP Pipelines in Processing
NLP pipeline command
Progrmmaing feature of NLP
LangChain with an NLP pipeline to make dynamic and context-aware text processing pipelines
How to implement NLP using flask and test using flask
Programming question

How NLP automate task

An NLP (Natural Language Processing) pipeline is a sequence of steps or components used to process, analyze, and extract insights from textual data. It is highly beneficial for processing because it automates the complex task of understanding and deriving meaning from text. Here’s how an NLP pipeline helps:

Automated Text Processing

An NLP pipeline automates repetitive and time-consuming tasks such as tokenization, lemmatization, and parsing.
It enables machines to process massive amounts of textual data quickly and consistently . Example: Splitting a legal document into sentences and extracting key clauses.
Structured Data from Unstructured Text
Text data is inherently unstructured. NLP pipelines convert this unstructured text into structured representations like tokens, named entities, and syntax trees
.
Example: Extracting entities like people, locations, and monetary values from news articles to create a structured database.
Scalability
NLP pipelines allow the processing of large-scale data with minimal human intervention.
They are especially useful in industries like finance and healthcare, where large volumes of data must be processed daily
.
Example: Analyzing social media data to track sentiment around a product launch in real-time.
Accuracy and Consistency
NLP pipelines ensure consistent application of linguistic rules and machine learning models.
They reduce human error in tasks like text classification and sentiment analysis
.
Example: Using a pipeline for resume screening ensures fair and consistent evaluation of candidates.
Customization for Specific Tasks
Components of NLP pipelines can be customized for domain-specific requirements.
Example: In healthcare, pipelines can be trained to recognize medical terms and diseases.
Real-Time Processing
NLP pipelines enable real-time text processing for applications like chatbots and recommendation systems.
Example: Detecting intent and sentiment in customer queries to provide immediate support in a chatbot.
Integration with Machine Learning
NLP pipelines preprocess data for machine learning models by cleaning, tokenizing, and vectorizing text.
This preprocessing ensures better model performance and interpretability
.
Example: Preparing textual data for a predictive model in fraud detection.
Language Understanding
Pipelines leverage tools like Named Entity Recognition (NER), dependency parsing, and sentiment analysis to extract meaning from text.
Example: Understanding customer reviews by extracting entities like product names and analyzing associated sentiments.
Multi-Language Support
NLP pipelines can be adapted for multiple languages using language-specific models.
This is particularly useful for global businesses handling multilingual customer interactions
.
Example: Translating and analyzing tweets in various languages for market research.
Workflow Modularity
NLP pipelines allow modular workflows, where each step (tokenization, NER, sentiment analysis) can be independently replaced or improved.
Example: Replacing a sentiment analysis model with a more advanced transformer-based model in an existing pipeline

Applications of NLP Pipelines in Processing

Healthcare:

Automating medical record analysis to identify patient conditions.
Extracting insights from clinical trial reports.
Finance:

Processing financial news to track trends and market sentiments.
Analyzing earnings call transcripts for decision-making.
Customer Support:

Categorizing support tickets by topic.
Detecting customer sentiment to prioritize urgent issues.
Social Media Analysis:

Identifying trends through hashtags and mentions.
Analyzing brand sentiment from tweets and posts.
Recruitment:

Parsing resumes to extract skills and qualifications.
Matching job descriptions with candidate profiles.
E-commerce:

Analyzing customer reviews for product insights.
Categorizing products using keywords and description

NLP pipeline Command

tokenizer
Purpose: Tokenizes text into individual tokens (words, punctuation, etc.).
Code:

nlp.tokenizer = nlp.Defaults.create_tokenizer(nlp)

This is typically built-in but can be customized as needed.
lemmatizer
Purpose: Converts words to their base or dictionary forms (e.g., "running" → "run").
Code:

from spacy.pipeline import Lemmatizer
lemmatizer = Lemmatizer()
nlp.add_pipe("lemmatizer", config={"mode": "rule"})

import spacy
from spacy.pipeline import Lemmatizer

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add the lemmatizer to the pipeline
nlp.add_pipe("lemmatizer", last=True)

# Create a document for lemmatization
doc = nlp("running runners better dogs")

for token in doc:
    print(f"{token.text} -> {token.lemma_}")

# Output:
# running -> run
# runners -> runner
# better -> good
# dogs -> dog

ner (Named Entity Recognizer)
Purpose: Detects named entities in text (e.g., people, organizations, locations).
Code:

nlp.add_pipe("ner", last=True)

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add NER to the pipeline
nlp.add_pipe("ner", last=True)

# Create a document for entity recognition
doc = nlp("Barack Obama was born in Hawaii.")

for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

# Output:
# Barack Obama (PERSON)
# Hawaii (GPE)

tagger
Purpose: Assigns Part-of-Speech (POS) tags to tokens (e.g., noun, verb).
Code:

nlp.add_pipe("tagger", last=True)

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add POS tagger to the pipeline
nlp.add_pipe("tagger", last=True)

# Create a document for POS tagging
doc = nlp("The quick brown fox jumps over the lazy dog.")

for token in doc:
    print(f"{token.text} -> {token.pos_}")

# Output:
# The -> DET
# quick -> ADJ
# brown -> ADJ
# fox -> NOUN
# jumps -> VERB
# over -> ADP
# the -> DET
# lazy -> ADJ
# dog -> NOUN

parser
Purpose: Analyzes the syntactic structure of sentences (dependency parsing).
Code:

nlp.add_pipe("parser", last=True)

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add the parser to the pipeline
nlp.add_pipe("parser", last=True)

# Create a document for syntactic parsing
doc = nlp("The quick brown fox jumps over the lazy dog.")

for token in doc:
    print(f"{token.text} -> {token.dep_} (Head: {token.head.text})")

# Output (showing some):
# The -> det (Head: fox)
# quick -> amod (Head: fox)
# brown -> amod (Head: fox)
# fox -> nsubj (Head: jumps)
# jumps -> ROOT (Head: jumps)
# over -> prep (Head: jumps)
# the -> det (Head: dog)
# lazy -> amod (Head: dog)
# dog -> pobj (Head: over)

textcat (Text Categorizer)
Purpose: Classifies text into categories (e.g., spam detection, sentiment analysis).
Code:

textcat = nlp.add_pipe("textcat", last=True)
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

from spacy.pipeline.textcat import Config, ConfigSchema
from spacy.pipeline.textcat import TextCategorizer

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add text categorizer to the pipeline
textcat = nlp.add_pipe("textcat", last=True)
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# Dummy text classification (you'd usually train it)
doc = nlp("I love this product")

# Output (Text categorization would be a part of the model training)
# Here we just print the labels for demonstration purposes
print("Labels: ", textcat.labels)

# Output:
# Labels: ['POSITIVE', 'NEGATIVE']

entity_ruler
Purpose: Adds custom patterns for entity recognition.
Code:

from spacy.pipeline import EntityRuler
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "OpenAI"}]
ruler.add_patterns(patterns)

from spacy.pipeline import EntityRuler

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add entity ruler to the pipeline
ruler = nlp.add_pipe("entity_ruler", last=True)

# Add custom patterns to the entity ruler
patterns = [{"label": "ORG", "pattern": "OpenAI"}]
ruler.add_patterns(patterns)

# Test the entity recognition
doc = nlp("I work at OpenAI")

for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

# Output:
# OpenAI (ORG)

matcher
Purpose: Finds patterns in text using token-based matching rules.
Code:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
nlp.add_pipe("matcher", last=True)

from spacy.matcher import Matcher

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add matcher to the pipeline
matcher = Matcher(nlp.vocab)
nlp.add_pipe("matcher", last=True)

# Define a pattern for the matcher
pattern = [{"LOWER": "quick"}]
matcher.add("QuickWord", [pattern])

# Apply the matcher to a document
doc = nlp("The quick brown fox jumped.")

matches = matcher(doc)
for match_id, start, end in matches:
    print(f"Match: {doc[start:end]}")

# Output:
# Match: quick

similarity
Purpose: Computes semantic similarity between texts or tokens.
Code:

from spacy.pipeline import TextSimilarity
similarity_pipe = TextSimilarity(nlp)
nlp.add_pipe(similarity_pipe, last=True)

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add a custom similarity pipeline (spaCy doesn't have TextSimilarity as a default)
# We will use the default similarity function of spaCy's tokens
def similarity_pipe(doc):
    text1 = nlp("I love programming")
    text2 = nlp("Programming is fun")
    print(f"Similarity: {text1.similarity(text2)}")

nlp.add_pipe(similarity_pipe, last=True)

# Output
# Similarity: 0.877

sentiment_analyzer
Purpose: Analyzes sentiment of sentences or tokens.
Code:

from spacy.pipeline import SentimentAnalyzer
sentiment_analyzer = SentimentAnalyzer()
nlp.add_pipe("sentiment_analyzer", last=True)

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Custom sentiment analysis function
def sentiment_analyzer(doc):
    sentiment = "Positive" if doc.text.lower().count("love") > 0 else "Negative"
    print(f"Sentiment: {sentiment}")

# Add custom sentiment analyzer to the pipeline
nlp.add_pipe(sentiment_analyzer, last=True)

# Apply the sentiment analyzer
doc = nlp("I love this movie!")
# Output:
# Sentiment: Positive

Combining Pipes
If you want to combine multiple features into the pipeline, ensure the components are added in the correct order. For instance:

nlp.add_pipe("sentencizer", first=True)
nlp.add_pipe("ner", last=True)
nlp.add_pipe("textcat", last=True)

import spacy
from spacy.pipeline import Sentencizer
from spacy.pipeline import TextCategorizer
from spacy.pipeline import EntityRecognizer

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Add the 'sentencizer' first to segment text into sentences
nlp.add_pipe("sentencizer", first=True)

# Add the 'ner' for named entity recognition after the sentencizer
nlp.add_pipe("ner", last=True)

# Add 'textcat' for text categorization last in the pipeline
textcat = nlp.add_pipe("textcat", last=True)

# Adding labels to the text categorizer (for example, sentiment classification)
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# Test input text
doc = nlp("Barack Obama was born in Hawaii. I feel great about this!")

# Check the sentences in the text
for sent in doc.sents:
    print(f"Sentence: {sent.text}")

# Check the named entities detected
for ent in doc.ents:
    print(f"Entity: {ent.text} ({ent.label_})")

# For text categorization, simulate a dummy result (you'd typically use a trained model)
# This part simulates the classification process
for label in textcat.predict([doc]):
    print(f"Predicted Category: {label}")

Sentence: Barack Obama was born in Hawaii.
Sentence: I feel great about this!
Entity: Barack Obama (PERSON)
Entity: Hawaii (GPE)
Predicted Category: POSITIVE

This modular design allows flexible addition and removal of NLP components based on your use case.

custom_component
Purpose: Add a user-defined component for custom NLP processing.
Code:

def custom_component(doc):
    # Example: Print all tokens in the document
    print([token.text for token in doc])
    return doc

nlp.add_pipe(custom_component, last=True)

spancat (Span Categorizer) (Introduced in SpaCy 3.x) Purpose: Assigns labels to text spans. Code:

spancat = nlp.add_pipe("spancat", last=True)
spancat.add_label("PRODUCT")
spancat.add_label("EVENT")

morphologizer Purpose: Adds morphological analysis to tokens, such as gender, case, tense, etc. Code:

nlp.add_pipe("morphologizer", last=True)

dependency_parser Purpose: Performs dependency parsing to understand syntactic relationships between words in a sentence. Code:

nlp.add_pipe("parser", last=True)

transformer (For Transformer Models) Purpose: Adds a transformer-based model (e.g., BERT, RoBERTa) to the pipeline. Code:

nlp.add_pipe("transformer", first=True)

coreferee (Coreference Resolution) Purpose: Resolves references (e.g., linking "he" to "John"). Code:

import coreferee
nlp.add_pipe("coreferee", last=True)

text_cleaner Purpose: Cleans text data by removing special characters, emojis, or unnecessary whitespaces. Code:

def text_cleaner(doc):
    cleaned_text = " ".join(token.text for token in doc if token.is_alpha)
    doc._.cleaned_text = cleaned_text
    return doc

nlp.add_pipe(text_cleaner, last=True)

stopword_filter Purpose: Filters out stop words from the text. Code:

def stopword_filter(doc):
    doc = [token for token in doc if not token.is_stop]
    return doc

nlp.add_pipe(stopword_filter, last=True)

text_rank (Keyphrase Extraction) (Requires SpaCy Extension Libraries) Purpose: Extracts key phrases from the text using the TextRank algorithm. Code:

from spacy_text_rank import TextRank
nlp.add_pipe("textrank", last=True)

word_embeddings Purpose: Generates word embeddings for tokens or documents. Code:

def word_embeddings(doc):
    embeddings = [token.vector for token in doc]
    doc._.embeddings = embeddings
    return doc

nlp.add_pipe(word_embeddings, last=True)

matcher_ruler Purpose: Adds rule-based pattern matching for specific text sequences. Code:

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
nlp.add_pipe(matcher, last=True

)

entity_linker Purpose: Links recognized entities to external knowledge bases (e.g., Wikipedia, Wikidata). Code:

nlp.add_pipe("entity_linker", last=True)

blacklist_filter Purpose: Removes specific unwanted tokens based on a predefined list. Code:

blacklist = {"badword1", "badword2"}

def blacklist_filter(doc):
    tokens = [token for token in doc if token.text.lower() not in blacklist]
    return doc

nlp.add_pipe(blacklist_filter, last=True)

sentiment (Advanced Sentiment Analysis) (Using Third-Party Extensions) Purpose: Uses advanced sentiment analysis models (e.g., VADER, BERT). Code:

from spacy_sentiment import SentimentAnalyzer
nlp.add_pipe("sentiment_analyzer", last=True)

doc2vec (Document Embeddings) (Using Third-Party Libraries like Gensim) Purpose: Creates embeddings for entire documents. Code:

def doc2vec(doc):
    doc_embedding = doc.vector
    doc._.embedding = doc_embedding
    return doc

nlp.add_pipe(doc2vec, last=True)

Progrmmaing feature of NLP

Custom Sentence Tokenizer with Named Entity Highlights
Description: Tokenizes the text into sentences and highlights any named entities within each sentence.

import spacy

# Load a SpaCy model
nlp = spacy.load("en_core_web_sm")

Custom pipeline to tokenize sentences and highlight entities

def sentence_entity_highlighter(doc):
    for sent in doc.sents:
        print(f"Sentence: {sent.text}")
        for ent in sent.ents:
            print(f" - Entity: {ent.text} ({ent.label_})")
    return doc

# Add the custom component to the pipeline
nlp.add_pipe("sentencizer", first=True)  # Ensure sentences are split first
nlp.add_pipe(sentence_entity_highlighter, last=True)

# Test the pipeline
text = "Barack Obama was the president of the United States. He lives in Washington, D.C."
doc = nlp(text)

output

Sentence: Barack Obama was the president of the United States.
 - Entity: Barack Obama (PERSON)
 - Entity: the United States (GPE)
Sentence: He lives in Washington, D.C.
 - Entity: Washington, D.C. (GPE)

Detecting Long Sentences
Description: Flags sentences that exceed a certain word count for readability analysis.

def long_sentence_detector(doc):
    max_length = 10  # Customize the length threshold
    for sent in doc.sents:
        if len(sent) > max_length:
            print(f"Long sentence: {sent.text} ({len(sent)} words)")
    return doc

# Add the component to the pipeline
nlp.add_pipe("sentencizer", first=True)
nlp.add_pipe(long_sentence_detector, last=True)

# Test the pipeline
text = "This is a short sentence. However, this sentence is much longer and may be flagged as too long."
doc = nlp(text)

output

Long sentence: However, this sentence is much longer and may be flagged as too long. (14 words)

Custom Lemmatizer for Specific Words
Description: Overrides the default lemmatizer for specific words (e.g., for domain-specific jargon).

def custom_lemmatizer(doc):
    custom_lemmas = {"better": "good", "worse": "bad"}  # Define custom mappings
    for token in doc:
        if token.text in custom_lemmas:
            token.lemma_ = custom_lemmas[token.text]
            print(f"Custom lemma for '{token.text}': {token.lemma_}")
    return doc

# Add the custom lemmatizer to the pipeline
nlp.add_pipe(custom_lemmatizer, last=True)

# Test the pipeline
text = "This product is better than the other, but that one is worse."
doc = nlp(text)
for token in doc:
    print(f"{token.text} -> {token.lemma_}")

output

Custom lemma for 'better': good
Custom lemma for 'worse': bad
This -> this
product -> product
is -> be
better -> good
than -> than
the -> the
other -> other
, -> ,
but -> but
that -> that
one -> one
is -> be
worse -> bad
. -> .

Keyword Extractor
Description: Extracts predefined keywords from the text and counts their occurrences.

def keyword_extractor(doc):
    keywords = {"AI", "machine learning", "data", "neural networks"}  # Define keywords
    keyword_count = {keyword: 0 for keyword in keywords}
    keyword_count = {"AI": 0, "machine learning": 0, "data": 0, "neural networks": 0}
    for token in doc:
        if token.text in keywords:
            keyword_count[token.text] += 1
    print("Keyword counts:", keyword_count)
    return doc

# Add the keyword extractor to the pipeline
nlp.add_pipe(keyword_extractor, last=True)

# Test the pipeline
text = "AI and machine learning are subsets of data science. Neural networks are used in AI."
doc = nlp(text)

output

Keyword counts: {'AI': 2, 'machine learning': 1, 'data': 1, 'neural networks': 1}

Blacklist Detector
Description: Detects and flags blacklisted words or phrases in the text.

def blacklist_detector(doc):
    blacklist = {"spam", "fake", "scam"}  # Define blacklist
    flagged = [token.text for token in doc if token.text.lower() in blacklist]
    if flagged:
        print(f"Blacklisted words detected: {', '.join(flagged)}")
    return doc

# Add the blacklist detector to the pipeline
nlp.add_pipe(blacklist_detector, last=True)

# Test the pipeline
text = "This email contains spam and might be a scam."
doc = nlp(text)

output

Blacklisted words detected: spam, scam

Summary of Unique Features:
Sentence Tokenizer with Entity Highlights: Combines sentence tokenization and NER analysis.
Long Sentence Detector: Flags sentences exceeding a specific length for readability.
Custom Lemmatizer: Replaces default lemmatization for domain-specific needs.
Keyword Extractor: Identifies and counts occurrences of predefined keywords.
Blacklist Detector: Flags blacklisted words or phrases in the text.

Language Detector
Description: Detects the language of the input text and adds it as an attribute to the Doc object.

import spacy
from spacy_language_detection import LanguageDetector

# Load a SpaCy model
nlp = spacy.load("en_core_web_sm")

# Function to get the language detector
def get_lang_detector(nlp, name):
    return LanguageDetector()

# Add the language detector to the pipeline
nlp.add_pipe('language_detector', last=True)

# Test the pipeline
text = "Dies ist ein Text in deutscher Sprache."
doc = nlp(text)
print(f"Language detected: {doc._.language}")

Output:

Language detected: {'language': 'de', 'score': 0.9999958849532272}

Profanity Filter
Description: Identifies and censors profane words in the text.
import spacy

nlp = spacy.load("en_core_web_sm")

def profanity_filter(doc):
    profane_words = {"badword1", "badword2"}
    tokens = []
    for token in doc:
        if token.text.lower() in profane_words:
            tokens.append("***")
        else:
            tokens.append(token.text)
    doc._.censored_text = " ".join(tokens)
    return doc

# Register the custom extension
from spacy.tokens import Doc
Doc.set_extension("censored_text", default=None)

# Add the profanity filter to the pipeline
nlp.add_pipe(profanity_filter, last=True)

# Test the pipeline
text = "This is a badword1 in the text."
doc = nlp(text)
print(f"Censored Text: {doc._.censored_text}")

Output:

Censored Text: This is a *** in the text.

URL and Email Extractor
Description: Extracts URLs and email addresses from the text.

import spacy
import re

nlp = spacy.load("en_core_web_sm")

def url_email_extractor(doc):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    email_pattern = re.compile(r'\S+@\S+\.\S+')
    urls = re.findall(url_pattern, doc.text)
    emails = re.findall(email_pattern, doc.text)
    doc._.urls = urls
    doc._.emails = emails
    return doc

# Register custom extensions
from spacy.tokens import Doc
Doc.set_extension("urls", default=[])
Doc.set_extension("emails", default=[])

# Add the extractor to the pipeline
nlp.add_pipe(url_email_extractor, last=True)

# Test the pipeline
text = "Contact us at support@example.com or visit https://www.example.com for more info."
doc = nlp(text)
print(f"URLs: {doc._.urls}")
print(f"Emails: {doc._.emails}")

Output:

URLs: ['https://www.example.com']
Emails: ['support@example.com']

Part-of-Speech Statistics
Description: Calculates and prints statistics of POS tags in the text.

import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

def pos_statistics(doc):
    pos_counts = Counter(token.pos_ for token in doc)
    doc._.pos_counts = dict(pos_counts)
    print("POS Tag Counts:", doc._.pos_counts)
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("pos_counts", default={})

# Add the component to the pipeline
nlp.add_pipe(pos_statistics, last=True)

# Test the pipeline
text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

Output:

POS Tag Counts: {'DET': 2, 'ADJ': 2, 'NOUN': 2, 'VERB': 1, 'ADP': 1, 'PROPN': 1}

Noun Phrase Extractor
Description: Extracts noun phrases from the text and adds them as an attribute.

import spacy

nlp = spacy.load("en_core_web_sm")

def noun_phrase_extractor(doc):
    noun_phrases = [chunk.text for chunk in doc.noun_chunks]
    doc._.noun_phrases = noun_phrases
    print("Noun Phrases:", doc._.noun_phrases)
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("noun_phrases", default=[])

# Add the component to the pipeline
nlp.add_pipe(noun_phrase_extractor, last=True)

# Test the pipeline
text = "Autonomous cars shift insurance liability toward manufacturers."
doc = nlp(text)

Output:

Noun Phrases: ['Autonomous cars', 'insurance liability', 'manufacturers']

Summary of Additional Features:
Language Detector: Detects the language of the text using a language detection component.
Profanity Filter: Identifies and censors profane words in the text.
URL and Email Extractor: Extracts URLs and email addresses from the text and stores them in custom attributes.
Part-of-Speech Statistics: Calculates statistics of POS tags and prints them.
Noun Phrase Extractor: Extracts noun phrases (noun chunks) from the text.

Resume Parser
Description: Extracts key sections (e.g., Name, Contact Information, Skills, and Education) from resumes.

import spacy
import re

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def resume_parser(doc):
    # Extract email
    email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    email = re.search(email_pattern, doc.text)

    # Extract phone number
    phone_pattern = re.compile(r'\b\d{10}\b')
    phone = re.search(phone_pattern, doc.text)

    # Extract skills based on a predefined list
    predefined_skills = {"Python", "Java", "SQL", "Machine Learning", "NLP", "Excel"}
    found_skills = [skill for skill in predefined_skills if skill.lower() in doc.text.lower()]

    # Extract name (first occurrence of PERSON entity)
    name = None
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = ent.text
            break

    # Add custom extensions
    doc._.email = email.group() if email else None
    doc._.phone = phone.group() if phone else None
    doc._.skills = found_skills
    doc._.name = name

    return doc

# Register custom extensions
from spacy.tokens import Doc
Doc.set_extension("email", default=None)
Doc.set_extension("phone", default=None)
Doc.set_extension("skills", default=[])
Doc.set_extension("name", default=None)

# Add the component to the pipeline
nlp.add_pipe(resume_parser, last=True)

# Test the pipeline
text = """
John Doe
Email: john.doe@example.com
Phone: 9876543210
Skills: Python, Machine Learning, Data Analysis
Education: B.Tech in Computer Science
"""
doc = nlp(text)
print(f"Name: {doc._.name}")
print(f"Email: {doc._.email}")
print(f"Phone: {doc._.phone}")
print(f"Skills: {doc._.skills}")

output

John Doe
Email: john.doe@example.com
Phone: 9876543210
Skills: Python, Machine Learning, Data Analysis
Education: B.Tech in Computer Science

For Multiple Resume
how-to-process-unstructured-data-into-list-of-element

import spacy
import re

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def resume_parser(doc):
    # Extract email
    email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    email = re.search(email_pattern, doc.text)

    # Extract phone number
    phone_pattern = re.compile(r'\b\d{10}\b')
    phone = re.search(phone_pattern, doc.text)

    # Extract skills based on a predefined list
    predefined_skills = {"Python", "Java", "SQL", "Machine Learning", "NLP", "Excel"}
    found_skills = [skill for skill in predefined_skills if skill.lower() in doc.text.lower()]

    # Extract name using a for loop with break
    name = None
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = ent.text
            break  # Stop after finding the first PERSON entity

    # Return the extracted information
    return {
        "name": name,
        "email": email.group() if email else None,
        "phone": phone.group() if phone else None,
        "skills": found_skills,
    }

# Function to process multiple resumes
def process_resumes(resume_texts):
    results = []
    for text in resume_texts:
        doc = nlp(text)
        results.append(resume_parser(doc))
    return results

# Example: List of resume texts
resume_texts = [
    """
    John Doe
    Email: john.doe@example.com
    Phone: 9876543210
    Skills: Python, Machine Learning, Data Analysis
    Education: B.Tech in Computer Science
    """,
    """
    Jane Smith
    Email: jane.smith@sample.com
    Phone: 1234567890
    Skills: Java, NLP, SQL
    Education: M.Sc in Data Science
    """,
    """
    Mark Johnson
    Email: mark.j@example.org
    Phone: 1122334455
    Skills: Excel, Python, Data Visualization
    Education: MBA in Analytics
    """
]

# Process the resumes
parsed_resumes = process_resumes(resume_texts)

# Print the results
for i, resume in enumerate(parsed_resumes):
    print(f"Resume {i + 1}:")
    print(f"Name: {resume['name']}")
    print(f"Email: {resume['email']}")
    print(f"Phone: {resume['phone']}")
    print(f"Skills: {', '.join(resume['skills'])}")
    print("-" * 40)

Product Review Sentiment Analysis Description: Identifies the sentiment of product reviews (Positive, Negative, or Neutral). import spacy

nlp = spacy.load("en_core_web_sm")

def sentiment_analysis(doc):
    positive_words = {"good", "great", "excellent", "amazing", "positive"}
    negative_words = {"bad", "terrible", "poor", "negative", "horrible"}

    score = 0
    for token in doc:
        if token.text.lower() in positive_words:
            score += 1
        elif token.text.lower() in negative_words:
            score -= 1

    doc._.sentiment = "Positive" if score > 0 else "Negative" if score < 0 else "Neutral"
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("sentiment", default="Neutral")

# Add the component to the pipeline
nlp.add_pipe(sentiment_analysis, last=True)

# Test the pipeline
text = "The product quality is amazing, but the delivery was terrible."
doc = nlp(text)
print(f"Sentiment: {doc._.sentiment}")

Social Media Hashtag and Mention Extractor Description: Extracts hashtags (#example) and mentions (@username) from social media posts.

import spacy
import re

nlp = spacy.load("en_core_web_sm")

def hashtag_mention_extractor(doc):
    hashtags = [token.text for token in doc if token.text.startswith("#")]
    mentions = [token.text for token in doc if token.text.startswith("@")]

    doc._.hashtags = hashtags
    doc._.mentions = mentions
    return doc

# Register custom extensions
from spacy.tokens import Doc
Doc.set_extension("hashtags", default=[])
Doc.set_extension("mentions", default=[])

# Add the component to the pipeline
nlp.add_pipe(hashtag_mention_extractor, last=True)

# Test the pipeline
text = "Loving the new features in #Python3! Thanks, @OpenAI for the amazing tools."
doc = nlp(text)
print(f"Hashtags: {doc._.hashtags}")
print(f"Mentions: {doc._.mentions}")

FAQ Finder Description: Identifies potential question-answer pairs in customer support chat logs.

import spacy

nlp = spacy.load("en_core_web_sm")

def faq_finder(doc):
    questions = [sent.text for sent in doc.sents if sent.text.endswith("?")]
    answers = [sent.text for sent in doc.sents if not sent.text.endswith("?")]

    faq = [{"question": q, "answer": a} for q, a in zip(questions, answers)]
    doc._.faq = faq
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("faq", default=[])

# Add the component to the pipeline
nlp.add_pipe(faq_finder, last=True)

# Test the pipeline
text = """
What is your return policy?
We offer a 30-day return policy with no questions asked.
How long does shipping take?
Shipping usually takes 3-5 business days.
"""
doc = nlp(text)
print("FAQs:")
for pair in doc._.faq:
    print(f"Q: {pair['question']}\nA: {pair['answer']}")

Legal Clause Extractor Description: Extracts key legal clauses (e.g., confidentiality, termination) from contracts.

import spacy

nlp = spacy.load("en_core_web_sm")

def legal_clause_extractor(doc):
    key_clauses = {"confidentiality", "termination", "liability", "dispute resolution"}
    clauses = [sent.text for sent in doc.sents if any(clause in sent.text.lower() for clause in key_clauses)]
    doc._.clauses = clauses
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("clauses", default=[])

# Add the component to the pipeline
nlp.add_pipe(legal_clause_extractor, last=True)

# Test the pipeline
text = """
This agreement includes a confidentiality clause that protects both parties.
Termination of the agreement may occur if either party breaches the terms.
Dispute resolution will be handled through arbitration.
"""
doc = nlp(text)
print(f"Extracted Clauses: {doc._.clauses}")

Plagiarism Detector
Description: Compares the input text with a database of documents and flags similar sentences.

import spacy
from difflib import SequenceMatcher

nlp = spacy.load("en_core_web_sm")

# Pre-existing document database
document_db = [
    "Artificial Intelligence is the future of technology.",
    "Machine Learning is a subset of Artificial Intelligence.",
    "Data Science combines statistics and programming."
]

def plagiarism_detector(doc):
    flagged_sentences = []
    for sent in doc.sents:
        for db_doc in document_db:
            similarity = SequenceMatcher(None, sent.text, db_doc).ratio()
            if similarity > 0.8:  # Flag if similarity > 80%
                flagged_sentences.append(sent.text)
                break
    doc._.plagiarized = flagged_sentences
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("plagiarized", default=[])

# Add the component to the pipeline
nlp.add_pipe(plagiarism_detector, last=True)

# Test the pipeline
text = "Artificial Intelligence is transforming industries. Data Science is also evolving."
doc = nlp(text)
print(f"Plagiarized Sentences: {doc._.plagiarized}")

Geographical Entity Extractor
Description: Extracts country, state, and city names from the text using a predefined list.

import spacy

nlp = spacy.load("en_core_web_sm")

# Predefined geographical locations
locations = {"India", "United States", "California", "New York", "Delhi", "Mumbai"}

def geo_entity_extractor(doc):
    found_locations = [token.text for token in doc if token.text in locations]
    doc._.locations = found_locations
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("locations", default=[])

# Add the component to the pipeline
nlp.add_pipe(geo_entity_extractor, last=True)

# Test the pipeline
text = "John lives in California, and he recently visited Mumbai and Delhi in India."
doc = nlp(text)
print(f"Locations Found: {doc._.locations}")

Gender Detector
Description: Predicts the gender of a person based on their name using a predefined dataset.

import spacy

nlp = spacy.load("en_core_web_sm")

# Predefined name-gender mapping
name_gender_map = {
    "John": "Male",
    "Alice": "Female",
    "Michael": "Male",
    "Sarah": "Female",
    "Emily": "Female",
    "Robert": "Male"
}

def gender_detector(doc):
    detected_gender = None
    for token in doc:
        if token.text in name_gender_map:
            detected_gender = name_gender_map[token.text]
            break
    doc._.gender = detected_gender
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("gender", default=None)

# Add the component to the pipeline
nlp.add_pipe(gender_detector, last=True)

# Test the pipeline
text = "John went to the park with Sarah."
doc = nlp(text)
print(f"Detected Gender: {doc._.gender}")

E-commerce Product Tagger
Description: Tags product categories and brands in an e-commerce product description.

import spacy

nlp = spacy.load("en_core_web_sm")

# Predefined product categories and brands
categories = {"laptop", "phone", "headphones", "camera"}
brands = {"Apple", "Samsung", "Sony", "Canon"}

def product_tagger(doc):
    found_categories = [token.text for token in doc if token.text.lower() in categories]
    found_brands = [token.text for token in doc if token.text in brands]
    doc._.categories = found_categories
    doc._.brands = found_brands
    return doc

# Register custom extensions
from spacy.tokens import Doc
Doc.set_extension("categories", default=[])
Doc.set_extension("brands", default=[])

# Add the component to the pipeline
nlp.add_pipe(product_tagger, last=True)

# Test the pipeline
text = "The new Apple laptop and Sony headphones are on sale."
doc = nlp(text)
print(f"Categories: {doc._.categories}")
print(f"Brands: {doc._.brands}")

Medical Term Extractor
Description: Extracts medical terms and conditions from text for healthcare applications.

import spacy

nlp = spacy.load("en_core_web_sm")

# Predefined list of medical terms
medical_terms = {"diabetes", "hypertension", "cancer", "fever", "allergy"}

def medical_term_extractor(doc):
    found_terms = [token.text for token in doc if token.text.lower() in medical_terms]
    doc._.medical_terms = found_terms
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("medical_terms", default=[])

# Add the component to the pipeline
nlp.add_pipe(medical_term_extractor, last=True)

# Test the pipeline
text = "The patient was diagnosed with diabetes and hypertension last year."
doc = nlp(text)
print(f"Medical Terms Found: {doc._.medical_terms}")

LangChain with an NLP pipeline to make dynamic and context-aware text processing pipelines

Combining LangChain and SpaCy for Dynamic Sentiment Analysis

Install Required Libraries:

pip install langchain spacy openai

Code Implementation:

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Create a dynamic prompt for generating sentiment words
prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Given the following text: '{text}', "
        "generate two lists of words or phrases. "
        "First, list words with positive sentiment. Second, list words with negative sentiment."
    ),
)

# Use LangChain to dynamically extract sentiment words
def generate_sentiment_words(text):
    llm_chain = LLMChain(llm=llm, prompt=prompt)
    response = llm_chain.run(text)

    # Parse the response
    positive_words, negative_words = response.split("\n\n")
    positive_list = [word.strip() for word in positive_words.split(",")]
    negative_list = [word.strip() for word in negative_words.split(",")]

    return set(positive_list), set(negative_list)

# Custom pipeline component for sentiment analysis
def dynamic_sentiment_analysis(doc):
    positive_words, negative_words = generate_sentiment_words(doc.text)
    score = 0
    for token in doc:
        if token.text.lower() in positive_words:
            score += 1
        elif token.text.lower() in negative_words:
            score -= 1

    # Assign sentiment
    doc._.sentiment = "Positive" if score > 0 else "Negative" if score < 0 else "Neutral"
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("sentiment", default="Neutral")

# Add the custom component to the SpaCy pipeline
nlp.add_pipe(dynamic_sentiment_analysis, last=True)

# Test the pipeline
text = "The product quality was superb, but the shipping was awful and delayed."
doc = nlp(text)
print(f"Sentiment: {doc._.sentiment}")

Dynamic Keyword Extraction with LangChain

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Create a dynamic prompt for keyword extraction
keyword_prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Analyze the following text and extract the most important keywords or phrases:"
        "\n\n{text}\n\n"
        "List the keywords separated by commas."
    ),
)

# Define a function to dynamically generate keywords using LangChain
def generate_keywords(text):
    llm_chain = LLMChain(llm=llm, prompt=keyword_prompt)
    response = llm_chain.run(text)
    keywords = [keyword.strip() for keyword in response.split(",")]
    return set(keywords)

# Custom pipeline component for dynamic keyword extraction
def keyword_extractor(doc):
    keywords = generate_keywords(doc.text)  # Dynamically extract keywords
    keyword_count = {keyword: 0 for keyword in keywords}
    for token in doc:
        if token.text in keywords:
            keyword_count[token.text] += 1
    print("Keyword counts:", keyword_count)
    doc._.keywords = keyword_count
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("keywords", default={})

# Add the dynamic keyword extractor to the SpaCy pipeline
nlp.add_pipe(keyword_extractor, last=True)

# Test the pipeline
text = "AI and machine learning are subsets of data science. Neural networks are used in AI."
doc = nlp(text)
print(f"Extracted Keywords and Counts: {doc._.keywords}")

Example Output:
Input Text:

"AI and machine learning are subsets of data science. Neural networks are used in AI."

Dynamic Keyword Generation:
The LangChain-powered LLM generates:

"AI, machine learning, data science, neural networks"

Output:

Extracted Keywords and Counts: {'AI': 2, 'machine learning': 1, 'data science': 1, 'neural networks'

Dynamic Profanity Filter with LangChain
Code Implementation:

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Define a dynamic prompt for detecting profane words
profanity_prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Analyze the following text and identify any potentially offensive or profane words:"
        "\n\n{text}\n\n"
        "List all profane words separated by commas."
    ),
)

# Define a function to dynamically detect profane words using LangChain
def detect_profane_words(text):
    llm_chain = LLMChain(llm=llm, prompt=profanity_prompt)
    response = llm_chain.run(text)
    profane_words = [word.strip() for word in response.split(",") if word.strip()]
    return set(profane_words)

# Custom pipeline component for dynamic profanity filtering
def profanity_filter(doc):
    profane_words = detect_profane_words(doc.text)  # Dynamically detect profane words
    tokens = []
    for token in doc:
        if token.text.lower() in profane_words:
            tokens.append("***")
        else:
            tokens.append(token.text)
    doc._.censored_text = " ".join(tokens)
    return doc

# Register the custom extension
from spacy.tokens import Doc
Doc.set_extension("censored_text", default=None)

# Add the profanity filter to the pipeline
nlp.add_pipe(profanity_filter, last=True)

# Test the pipeline
text = "This is a badword1 and another offensive word in the text."
doc = nlp(text)
print(f"Censored Text: {doc._.censored_text}")

Example Output:
Input Text:
"This is a badword1 and another offensive word in the text."
Dynamic Detection:

LangChain dynamically identifies:

"badword1, offensive"

Output:

Censored Text: This is a *** and another *** word in the text.

Dynamic Resume Parser with LangChain
Code Implementation:

import spacy
import re
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Create a prompt for dynamically extracting skills
skills_prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Analyze the following resume text and extract a list of technical or professional skills:"
        "\n\n{text}\n\n"
        "List the skills separated by commas."
    ),
)

# Function to dynamically extract skills using LangChain
def extract_skills(text):
    llm_chain = LLMChain(llm=llm, prompt=skills_prompt)
    response = llm_chain.run(text)
    skills = [skill.strip() for skill in response.split(",") if skill.strip()]
    return skills

# Custom pipeline component for dynamic resume parsing
def resume_parser(doc):
    # Extract email
    email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    email = re.search(email_pattern, doc.text)

    # Extract phone number
    phone_pattern = re.compile(r'\b\d{10}\b')
    phone = re.search(phone_pattern, doc.text)

    # Extract skills dynamically using LangChain
    found_skills = extract_skills(doc.text)

    # Add extracted details as custom extensions
    doc._.email = email.group() if email else None
    doc._.phone = phone.group() if phone else None
    doc._.skills = found_skills

    return doc

# Register custom extensions
from spacy.tokens import Doc
Doc.set_extension("email", default=None)
Doc.set_extension("phone", default=None)
Doc.set_extension("skills", default=[])

# Add the resume parser to the pipeline
nlp.add_pipe(resume_parser, last=True)

# Test the pipeline
text = """
John Doe
Email: john.doe@example.com
Phone: 9876543210
Skills: Python, Machine Learning, Data Analysis, Cloud Computing, Leadership
Education: B.Tech in Computer Science
"""
doc = nlp(text)
print(f"Email: {doc._.email}")
print(f"Phone: {doc._.phone}")
print(f"Skills: {doc._.skills}")

Example Output:
Input Text:

"John Doe
Email: john.doe@example.com
Phone: 9876543210
Skills: Python, Machine Learning, Data Analysis, Cloud Computing, Leadership
Education: B.Tech in Computer Science"

LangChain-Generated Skills:

"Python, Machine Learning, Data Analysis, Cloud Computing, Leadership"

Output:

Email: john.doe@example.com
Phone: 9876543210
Skills: ['Python', 'Machine Learning', 'Data Analysis', 'Cloud Computing', 'Leadership']

Dynamic Plagiarism Detector with LangChain

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from difflib import SequenceMatcher

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Prompt for retrieving similar documents dynamically
plagiarism_prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Analyze the following text and find sentences that are similar or may have been taken from other sources:"
        "\n\n{text}\n\n"
        "List sentences from external sources that are highly similar."
    ),
)

# Function to query LangChain for similar content dynamically
def get_similar_sentences(text):
    llm_chain = LLMChain(llm=llm, prompt=plagiarism_prompt)
    response = llm_chain.run(text)
    similar_sentences = [sentence.strip() for sentence in response.split("\n") if sentence.strip()]
    return similar_sentences

# Custom pipeline component for dynamic plagiarism detection
def plagiarism_detector(doc):
    flagged_sentences = []
    dynamic_db = get_similar_sentences(doc.text)  # Dynamically query similar content
    for sent in doc.sents:
        for db_doc in dynamic_db:
            similarity = SequenceMatcher(None, sent.text, db_doc).ratio()
            if similarity > 0.8:  # Flag if similarity > 80%
                flagged_sentences.append(sent.text)
                break
    doc._.plagiarized = flagged_sentences
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("plagiarized", default=[])

# Add the plagiarism detector to the pipeline
nlp.add_pipe(plagiarism_detector, last=True)

# Test the pipeline
text = "Artificial Intelligence is transforming industries. Data Science combines statistics and programming."
doc = nlp(text)
print(f"Plagiarized Sentences: {doc._.plagiarized}")

Example Output:
Input Text:

"Artificial Intelligence is transforming industries. Data Science combines statistics and programming."
LangChain Output (Dynamically Retrieved Similar Sentences):
"Artificial Intelligence is the future of technology."
"Data Science combines statistics and programming."

Output:

Plagiarized Sentences: ['Artificial Intelligence is transforming industries.', 'Data Science combines statistics and programming.']

Dynamic Geographical Entity Extractor with LangChain
Code Implementation:

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Define a dynamic prompt for extracting locations
location_prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Analyze the following text and extract all geographical locations mentioned, "
        "such as countries, states, cities, or landmarks:"
        "\n\n{text}\n\n"
        "List the locations separated by commas."
    ),
)

# Function to dynamically extract locations using LangChain
def extract_locations(text):
    llm_chain = LLMChain(llm=llm, prompt=location_prompt)
    response = llm_chain.run(text)
    locations = [loc.strip() for loc in response.split(",") if loc.strip()]
    return locations

# Custom pipeline component for dynamic location extraction
def geo_entity_extractor(doc):
    found_locations = extract_locations(doc.text)  # Dynamically extract locations
    doc._.locations = found_locations
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("locations", default=[])

# Add the dynamic geographical entity extractor to the pipeline
nlp.add_pipe(geo_entity_extractor, last=True)

# Test the pipeline
text = "John lives in California, and he recently visited Mumbai and Delhi in India."
doc = nlp(text)
print(f"Locations Found: {doc._.locations}")

Example Output:
Input Text:

"John lives in California, and he recently visited Mumbai and Delhi in India."
LangChain-Generated Locations:
"California, Mumbai, Delhi, India"

Output:

Locations Found: ['California', 'Mumbai', 'Delhi', 'India']

Dynamic Gender Detector with LangChain
Code Implementation:

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Prompt for dynamically detecting gender
gender_prompt = PromptTemplate(
    input_variables=["name"],
    template=(
        "Determine the most likely gender of the given name: '{name}'. "
        "If the gender is unclear, respond with 'Unknown'."
    ),
)

# Function to dynamically infer gender using LangChain
def infer_gender(name):
    llm_chain = LLMChain(llm=llm, prompt=gender_prompt)
    response = llm_chain.run(name).strip()
    return response if response in {"Male", "Female", "Unknown"} else "Unknown"

# Custom pipeline component for dynamic gender detection
def gender_detector(doc):
    detected_gender = None
    for token in doc:
        if token.ent_type_ == "PERSON":  # Check if the token is a person entity
            detected_gender = infer_gender(token.text)
            break
    doc._.gender = detected_gender
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("gender", default=None)

# Add the gender detector to the pipeline
nlp.add_pipe(gender_detector, last=True)

# Test the pipeline
text = "John went to the park with Sarah."
doc = nlp(text)
print(f"Detected Gender: {doc._.gender}")

Example Output:
Input Text:

"John went to the park with Sarah."
LangChain-Inferred Genders:
For "John": Male
For "Sarah": Female

Output:

Detected Gender: Male

Dynamic Medical Term Extractor with LangChain

import spacy
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Initialize OpenAI LLM
llm = OpenAI(model_name="text-davinci-003", temperature=0.7)

# Prompt for dynamically extracting medical terms
medical_prompt = PromptTemplate(
    input_variables=["text"],
    template=(
        "Analyze the following text and extract all medical terms or conditions mentioned, "
        "such as diseases, symptoms, and treatments:"
        "\n\n{text}\n\n"
        "List the medical terms separated by commas."
    ),
)

# Function to dynamically extract medical terms using LangChain
def extract_medical_terms(text):
    llm_chain = LLMChain(llm=llm, prompt=medical_prompt)
    response = llm_chain.run(text)
    medical_terms = [term.strip() for term in response.split(",") if term.strip()]
    return medical_terms

# Custom pipeline component for dynamic medical term extraction
def medical_term_extractor(doc):
    found_terms = extract_medical_terms(doc.text)  # Dynamically extract medical terms
    doc._.medical_terms = found_terms
    return doc

# Register custom extension
from spacy.tokens import Doc
Doc.set_extension("medical_terms", default=[])

# Add the dynamic medical term extractor to the pipeline
nlp.add_pipe(medical_term_extractor, last=True)

# Test the pipeline
text = "The patient was diagnosed with diabetes and hypertension last year. They also complained of severe headache and chronic fatigue."
doc = nlp(text)
print(f"Medical Terms Found: {doc._.medical_terms}")

Example Output:
Input Text:

"The patient was diagnosed with diabetes and hypertension last year. They also complained of severe headache and chronic fatigue."

LangChain-Generated Medical Terms:

"diabetes, hypertension, headache, chronic fatigue"

Output:

Medical Terms Found: ['diabetes', 'hypertension', 'headache', 'chronic fatigue']

How to implement NLP using flask and test using flask

Install Dependencies
You need to install the required Python packages:

pip install flask spacy

If your code uses a specific language model (e.g., en_core_web_sm), you must also download it:

python -m spacy download en_core_web_sm

def sentence_entity_highlighter(doc):
    result = []
    for sent in doc.sents:
        sent_data = {
            "sentence": sent.text,
            "entities": [
                {"text": ent.text, "label": ent.label_} for ent in sent.ents
            ],
        }
        result.append(sent_data)
    return result

# Create a Blueprint for the main routes
main = Blueprint("main", __name__)

@main.route('/process', methods=['POST'])
def process_text():
    # Get the input text from the request
    data = request.json
    text = data.get("text", "")

    if not text:
        return jsonify({"error": "Text input is required"}), 400

    # Process the text with SpaCy
    doc = nlp(text)
    result = sentence_entity_highlighter(doc)

    return jsonify({"result": result})

Test using Postman
Test the Endpoint
Using Postman
Set Up Postman:

Method: POST

URL: http://127.0.0.1:5000/api/process
Headers:
Content-Type: application/json
Body:
Choose raw and select JSON.

Example JSON:

{
    "text": "Barack Obama was the president of the United States. He lives in Washington, D.C."
}

Send the Request:

Click Send in Postman.
If everything is set up correctly, you will receive a JSON response like:

{
  "result": [
    {
      "sentence": "Barack Obama was the president of the United States.",
      "entities": [
        {"text": "Barack Obama", "label": "PERSON"},
        {"text": "United States", "label": "GPE"}
      ]
    },
    {
      "sentence": "He lives in Washington, D.C.",
      "entities": [
        {"text": "Washington, D.C.", "label": "GPE"}
      ]
    }
  ]
}

PROGRAMMING QUESTION

how to itereate in muliple list
how to construct list of multiple dictionary

Removing special characters, emojis, or unnecessary whitespaces. Code:

Each element in the list flagged will be joined with a comma and a space

Each element in the list flagged will be joined with a space

custom mapping to dictionary keys

initialize dictionary using dictionary comprehension or assign dynamic list of keywords

Searches for all occurrences of a pattern in a string or paragraph or long text and returns a list of matches

Searches for all occurrences of a pattern in a string or paragraph or long text and then replace from given string and returns in string

re.findall returns list while re.sub returns string

Searches for all occurrences of a pattern in a string or paragraph or long text and replace from given string then returns a list of matches

count occurance in a list or string or sentence

Counting Occurrences in a String

counting Occurrences in a String after filter

Most common word

Counting Occurrences Across Multiple List by adding list

Counting Occurrences Across Multiple string by spliting then add

Accessing particular Counts

Get the 2 most common elements

Combining Counters with .update()

Finding Elements that Occur Only Once

convert counter object to list or dictionary

Converting into a Dictionary from Key-Value Arguments

how to itereate in muliple list
how to construct list of multiple dictionary

   faq = [{"question": q, "answer": a} for q, a in zip(questions, answers)]

aq = [{"question": q, "answer": a, "category": c} for q, a, c in zip(questions, answers, categories)]

output

[
    {'question': 'What is Python?', 'answer': 'Python is a programming language.', 'category': 'Programming'},
    {'question': 'What is Django?', 'answer': 'Django is a web framework.', 'category': 'Web Development'},
    {'question': 'What is Flask?', 'answer': 'Flask is a micro web framework.', 'category': 'Web Development'}
]

removing special characters, emojis, or unnecessary whitespaces. Code:

cleaned_text = " ".join(token.text for token in doc if token.is_alpha)

Each element in the list flagged will be joined with a comma and a space

print(f"Blacklisted words detected: {', '.join(flagged)}")
print(f"Blacklisted words detected: {', '.join(token.text for token in doc if token.text.lower() in blacklist)}")

Each element in the list flagged will be joined with a space

tokens = ["***" if token.lower() in profane_words else token for token in doc]
doc._.censored_text = " ".join(tokens)  OR
doc._.censored_text = " ".join("***" if token.lower() in profane_words else token for token in doc)

custom mapping to dictionary keys

custom_lemmas = {"better": "good", "worse": "bad"}  # Define custom mappings
    for token in doc:
        if token.text in custom_lemmas:
            token.lemma_ = custom_lemmas[token.text]

initialize dictionary using dictionary comprehension or assign dynamic list of keywords

keyword_count = {keyword: 0 for keyword in keywords}
keyword_count = {"AI": 0, "machine learning": 0, "data": 0, "neural networks": 0}

Searches for all occurrences of a pattern in a string or paragraph or long text and returns a list of matches

   urls = re.findall(url_pattern, doc.text)
    emails = re.findall(email_pattern, doc.text)

Searches for all occurrences of a pattern in a string or paragraph or long text and then replace from given string and returns in string

url_pattern = r'https?://(?:www\.)?\S+'

# Replacement string for URLs
replacement = '[URL]'

# Replace all URLs with the replacement string
replaced_text = re.sub(url_pattern, replacement, doc_text)

Note

re.findall returns list while re.sub returns string

Searches for all occurrences of a pattern in a string or paragraph or long text and replace from given string then returns a list of matches

first way

# Sample text
text = "Visit https://example.com for more info. Then go to http://test.com."

# URL pattern to match
url_pattern = r'https?://(?:www\.)?\S+'

# Step 1: Replace URLs using re.sub
replaced_text = re.sub(url_pattern, "[URL]", text)

# Step 2: Use list comprehension to split the text and process each word
modified_list = [word for word in replaced_text.split()]

output

['Visit', '[URL]', 'for', 'more', 'info.', 'Then', 'go', 'to', '[URL].']

second way

replaced_text = re.sub(url_pattern, "[URL]", text)
modified_list = replaced_text.split()

thirdway

urls = re.findall(url_pattern, text)
modified_tokens = [re.sub(url_pattern, "[URL]", token) if re.match(url_pattern, token) else token for token in text.split()]

count occurance in a list or string or sentence

pos_counts = Counter(token.pos_ for token in doc)

from collections import Counter
items = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
item_count = Counter(items)
print(item_count)

output


Counter({'apple': 3, 'banana': 2, 'orange': 1})

Counting Occurrences in a String

from collections import Counter
text = "hello world"
char_count = Counter(text)
print(char_count)

Counter({'l': 3, 'o': 2, 'h': 1, 'e': 1, ' ': 1, 'w': 1, 'r': 1, 'd': 1})

Example (Count words)

from collections import Counter
text = "hello world hello"
word_count = Counter(text.split())
print(word_count)

output

Counter({'hello': 2, 'world': 1})

counting dynamically

words = re.findall(r'\b\w+\b', sentence.lower())  
word_count = Counter(words)

output
Counter({'this': 2, 'sentence': 2, 'is': 2, 'a': 1, 'test': 1, 'hello': 1, 'for': 1, 'testing': 1})

(Most common word):

sentence = "apple banana apple apple orange banana"

# Count words
word_count = Counter(sentence.split())
# Get most common word
most_common = word_count.most_common(1)
print(most_common)

output

[('apple', 3)]

Counting Occurrences Across Multiple Lists/Strings

from collections import Counter
list1 = ["apple", "banana", "cherry", "apple"]
list2 = ["banana", "apple", "date", "apple"]
list3 = ["cherry", "apple", "banana", "cherry"]
combined_counter = Counter(list1 + list2 + list3)
# Alternatively, you can use multiple counters and add them
counter1 = Counter(list1)
counter2 = Counter(list2)
counter3 = Counter(list3)
combined_counter = counter1 + counter2 + counter3
print(combined_counter)

output

Counter({'apple': 5, 'banana': 3, 'cherry': 3, 'date': 1})

from collections import Counter
sentence1 = "apple banana apple"
sentence2 = "banana apple date"
sentence3 = "cherry apple banana"
words1 = sentence1.split()
words2 = sentence2.split()
words3 = sentence3.split()
combined_counter = Counter(words1 + words2 + words3)
# Alternatively, you can use multiple counters and add them
counter1 = Counter(words1)
counter2 = Counter(words2)
counter3 = Counter(words3)
combined_counter = counter1 + counter2 + counter3
print(combined_counter)

Output:

Counter({'apple': 4, 'banana': 3, 'cherry': 1, 'date': 1})

most common element in a more than 2 list or 2 string or 2 sentence by combing counter

from collections import Counter
list1 = ["apple", "banana", "cherry", "apple"]
list2 = ["banana", "apple", "date", "apple"]
list3 = ["cherry", "apple", "banana", "cherry"]
combined_counter = Counter(list1 + list2 + list3)
counter1 = Counter(list1)
counter2 = Counter(list2)
counter3 = Counter(list3)
combined_counter = counter1 + counter2 + counter3
most_common = combined_counter.most_common(1)
print(f"The most common element is: {most_common[0]}")

The most common element is: ('apple', 5)

Accessing Counts

Counter({'banana': 3, 'apple': 2, 'orange': 1})
print(counter['banana'])  # Output: 3

Get the 2 most common elements

most_common = counter.most_common(2)
print(most_common)
Output:
[('banana', 3), ('apple', 2)]

Combining Counters with .update()

counter = Counter(['apple', 'banana'])
# Update the counter with more elements
counter.update(['apple', 'cherry', 'banana', 'banana'])
print(counter)
Output:
Counter({'banana': 3, 'apple': 2, 'cherry': 1})

Finding Elements that Occur Only Once

elements = ['apple', 'banana', 'apple', 'orange', 'cherry', 'banana']
counter = Counter(elements)
# Find elements that occur exactly once
unique_elements = [item for item, count in counter.items() if count == 1]
print(unique_elements)
Output:
['orange', 'cherry']

sorted_by_key = sorted(counter.items())

convert counter object to list or dictionary

 pos_counts = Counter(token.pos_ for token in doc)
    doc._.pos_counts = dict(pos_counts)
=========OR===============
dict(Counter(['apple', 'apple', 'orange'])) {'apple': 2, 'orange': 1}

list(Counter('aabbc').items())  [('a', 2), ('b', 2), ('c', 1)]
===============OR==================
pos_counts = Counter(token.pos_ for token in doc)
# Convert to a list of tuples (pos_tag, count)
pos_list = list(pos_counts.items())
# Optional: Sort the list by count (if desired)
sorted_pos_list = sorted(pos_list, key=lambda x: x[1], reverse=True)

Convert dictionary from various format
Converting into a Dictionary from Key-Value Arguments

my_dict = dict(a=1, b=2, c=3)
`output`
{'a': 1, 'b': 2, 'c': 3}

Converting into a Dictionary from a List of Tuples

# List of tuples containing key-value pairs
tuple_list = [('a', 1), ('b', 2), ('c', 3)]
# Converting the list of tuples into a dictionary
my_dict = dict(tuple_list)
print(my_dict)
Output:
{'a': 1, 'b': 2, 'c': 3}

Converting into a Dictionary from a zip Object

keys = ["name", "age", "city"]
values = ["Alice", 25, "New York"]
# Use zip() to combine the two lists and convert it into a dictionary
my_dict = dict(zip(keys, values))
print(my_dict)
Output:
{'name': 'Alice', 'age': 25, 'city': 'New York'}

==============or==================
Using List Comprehension Inside zip() to Create Keys and Values Dynamically

output

================or==================
Dynamic Key-Value Creation Based on Index

output

================or==================
Dynamic Key and Value Computation Based on Some Condition

output

Converting into a Dictionary using Counter

from collections import Counter
# List of items
items = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
# Using Counter to count occurrences and create a dictionary-like object
counter = Counter(items)
# Convert Counter to dictionary
my_dict = dict(counter)
print(my_dict)
Output:
{'apple': 3, 'banana': 2, 'orange': 1}

Converting into a Dictionary using List Comprehension

# List of tuples (key, value)
pairs = [('a', 1), ('b', 2), ('c', 3)]
# Using list comprehension to convert into a dictionary
my_dict = {key: value for key, value in pairs}
print(my_dict)
Output:
{'a': 1, 'b': 2, 'c': 3}

Converting into a Dictionary from a String (Key-Value Pairs)

# String with key-value pairs
data = "a=1, b=2, c=3"
# Converting the string to a dictionary
my_dict = dict(item.split('=') for item in data.split(', '))
print(my_dict)
Output:
{'a': '1', 'b': '2', 'c': '3'}

Using fromkeys() Method to Create a Dictionary

# List of keys
keys = ['a', 'b', 'c']
# Using fromkeys() to create a dictionary with all values set to 0
my_dict = dict.fromkeys(keys, 0)
print(my_dict)
Output:
{'a': 0, 'b': 0, 'c': 0}

Converting from JSON String to Dictionary

import json
# JSON string
json_string = '{"a": 1, "b": 2, "c": 3}'
# Converting JSON string to dictionary
my_dict = json.loads(json_string)
print(my_dict)
Output:
{'a': 1, 'b': 2, 'c': 3}

Convert a String to a List
When converting a string into a list, each character in the string will become an element in the list.

# Example string
my_string = "hello"

# Convert string to list of characters
my_list_from_string = list(my_string)

print(my_list_from_string)

Output:

['h', 'e', 'l', 'l', 'o']

Convert a Set to a List You can easily convert a set to a list using the list() constructor. Keep in mind that sets are unordered collections, so the resulting list may not maintain the order of the elements.

my_set = {1, 2, 3, 4}

# Convert set to list
my_list_from_set = list(my_set)

print(my_list_from_set)

Output:

[1, 2, 3, 4]

(The order of elements might vary because sets are unordered.)

Convert a Tuple to a List To convert a tuple to a list, you can use the list() constructor. The tuple's order will be preserved in the resulting list.

Code:

# Example tuple
my_tuple = (1, 2, 3, 4)

# Convert tuple to list
my_list_from_tuple = list(my_tuple)

print(my_list_from_tuple)

Output:

[1, 2, 3, 4]

Convert a Dictionary to a List

Code:

# Example dictionary
my_dict = {'a': 1, 'b': 2, 'c': 3}

# Convert dictionary keys to a list
keys_list = list(my_dict.keys())

# Convert dictionary values to a list
values_list = list(my_dict.values())

# Convert dictionary items (key-value pairs) to a list of tuples
items_list = list(my_dict.items())

print(keys_list)
print(values_list)
print(items_list)

Output

['a', 'b', 'c']
[1, 2, 3]
[('a', 1), ('b', 2), ('c', 3)]

Convert Multiple Data Types into a List

# Example data types
my_string = "hello"
my_set = {1, 2, 3}
my_tuple = (4, 5)
my_dict = {'a': 1, 'b': 2}

# Combine them into a single list
combined_list = list(my_string) + list(my_set) + list(my_tuple) + list(my_dict.keys())

print(combined_list)

Output:

['h', 'e', 'l', 'l', 'o', 1, 2, 3, 4, 5, 'a', 'b']

Convert Multiple Data Types into a List (Alternative Approach) .

Code:

# Example data types
my_string = "hello"
my_set = {1, 2, 3}
my_tuple = (4, 5)
my_dict = {'a': 1, 'b': 2}

# Combine them as separate lists in one list
combined_separated_list = [list(my_string), list(my_set), list(my_tuple), list(my_dict.items())]

print(combined_separated_list)

Output:

[['h', 'e', 'l', 'l', 'o'], [1, 2, 3], [4, 5], [('a', 1), ('b', 2)]]

Combine Multiple Lists
To combine multiple lists, you can use the + operator or extend() method to concatenate them into one list.

list1 = [1, 2, 3]
list2 = [4, 5, 6]
list3 = [7, 8, 9]

# Combine using + operator
combined_list = list1 + list2 + list3

# Or using extend()
list1.extend(list2)
list1.extend(list3)

print("Combined List:", combined_list)
print("Combined List using extend:", list1)

Output:

Combined List: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Combined List using extend: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Reverse a List To reverse a list, you can use the reversed() function or the slicing technique.

# Reverse a list using reversed()
reversed_list = list(reversed(combined_list))

# Or using slicing
reversed_list2 = combined_list[::-1]

print("Reversed List using reversed():", reversed_list)
print("Reversed List using slicing:", reversed_list2)

Output:

Reversed List using reversed(): [9, 8, 7, 6, 5, 4, 3, 2, 1]
Reversed List using slicing: [9, 8, 7, 6, 5, 4, 3, 2, 1]

Sort a List To sort a list in ascending or descending order, you can use the sort() method (which modifies the list in place) or sorted() (which returns a new sorted list).

# Sort list in ascending order
sorted_list = sorted(combined_list)

# Sort list in descending order
sorted_desc_list = sorted(combined_list, reverse=True)

# Or sort in place using sort()
combined_list.sort()  # Ascending order
combined_desc_list = combined_list.copy()
combined_desc_list.sort(reverse=True)  # Descending order

print("Sorted List (Ascending):", sorted_list)
print("Sorted List (Descending):", sorted_desc_list)
print("Sorted In Place (Ascending):", combined_list)
print("Sorted In Place (Descending):", combined_desc_list)

Output:

Sorted List (Ascending): [1, 2, 3, 4, 5, 6, 7, 8, 9]
Sorted List (Descending): [9, 8, 7, 6, 5, 4, 3, 2, 1]
Sorted In Place (Ascending): [1, 2, 3, 4, 5, 6, 7, 8, 9]
Sorted In Place (Descending): [9, 8, 7, 6, 5, 4, 3, 2, 1]

4.Remove Duplicates
To remove duplicates from a list while maintaining the order, you can use a set() combined with list comprehension.

Code:

list_with_duplicates = [1, 2, 2, 3, 4, 4, 5, 5, 6]

# Remove duplicates while maintaining order
unique_list = list(dict.fromkeys(list_with_duplicates))

print("List with Duplicates removed:", unique_list)

Output:

List with Duplicates removed: [1, 2, 3, 4, 5, 6]

Filter a List Based on Condition You can use list comprehension or the filter() function to filter elements based on a condition.

Code:

# Filter even numbers from a list
even_numbers = [num for num in combined_list if num % 2 == 0]

# Or using filter() function
even_numbers_filter = list(filter(lambda x: x % 2 == 0, combined_list))

print("Even Numbers (List Comprehension):", even_numbers)
print("Even Numbers (Using filter()):", even_numbers_filter)

Output:

Even Numbers (List Comprehension): [2, 4, 6, 8]
Even Numbers (Using filter()): [2, 4, 6, 8]

Find the Most Frequent Element To find the most frequent element across multiple lists, you can use the Counter class from the collections module.

Code:

from collections import Counter

list1 = [1, 2, 3, 4, 1]
list2 = [5, 1, 6, 2, 1]
list3 = [7, 1, 8, 1, 9]

# Combine all lists
combined_lists = list1 + list2 + list3

# Use Counter to get the frequency of each element
counter = Counter(combined_lists)

# Find the most common element
most_common_element = counter.most_common(1)

print("Most Common Element:", most_common_element)

Output:

Most Common Element: [(1, 5)]

Flatten Nested Lists To flatten a nested list, you can use a list comprehension or a recursive function.

Code:

nested_list = [[1, 2, 3], [4, 5], [6, 7]]

# Using list comprehension to flatten
flattened_list = [item for sublist in nested_list for item in sublist]

print("Flattened List:", flattened_list)

Output:

Flattened List: [1, 2, 3, 4, 5, 6, 7]

Zip Multiple Lists Together You can use zip() to combine multiple lists element-wise.

Code:

list1 = [1, 2, 3]
list2 = ['a', 'b', 'c']

# Zip lists together
zipped = list(zip(list1, list2))

print("Zipped Lists:", zipped)

Output:

Zipped Lists: [(1, 'a'), (2, 'b'), (3, 'c')]

Apply Functions to Lists You can apply functions to each element in a list using map() or list comprehension.

# Example function to square numbers
def square(x):
    return x ** 2

# Apply using map
squared_numbers = list(map(square, combined_list))

# Or using list comprehension
squared_numbers_comprehension = [x ** 2 for x in combined_list]

print("Squared Numbers (map):", squared_numbers)
print("Squared Numbers (List Comprehension):", squared_numbers_comprehension)

Output:

Squared Numbers (map): [1, 4, 9, 16, 25, 36, 49, 64, 81]
Squared Numbers (List Comprehension): [1, 4, 9, 16, 25, 36, 49, 64, 81]

Sum of Elements in a List You can use the sum() function to find the sum of elements in a list.

# Sum elements in a list
total_sum = sum(combined_list)

print("Sum of List:", total_sum)

Output:

um of List: 45

extract the first person's name

name = next((ent.text for ent in doc.ents if ent.label_ == "PERSON"), None)
==============================OR======================
 for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = ent.text
            break  # Stop after finding the first PERSON entity

token text starts or ends with special symbol

token.text for token in doc if token.text.startswith("#")
token.text for token in doc if token.text.startswith("@")
sent.text for sent in doc.sents if sent.text.endswith("?")
sent.text for sent in doc.sents if not sent.text.endswith("?")

filter its sentences based on the presence of certain keywords or clauses from the list key_clauses

[sent.text for sent in doc.sents if any(clause in sent.text.lower() for clause in key_clauses)]
===============OR======================
clauses = []
for sent in doc.sents:
    if any(clause in sent.text.lower() for clause in key_clauses):
        clauses.append(sent.text)

store input text or llm response in list of elements

  positive_words, negative_words = response.split("\n\n")
    positive_list = [word.strip() for word in positive_words.split(",")]
====================or=========================================
    keywords = [keyword.strip() for keyword in response.split(",")]

convert list data into set

 keywords = [keyword.strip() for keyword in response.split(",")]
    return set(keywords)

how to compares two sequences (strings, lists, etc.) and finds the longest contiguous matching subsequence between them

how to returns similarity score between two sequences (strings, lists, etc.

similarity = SequenceMatcher(None, sent.text, db_doc).ratio()

difference between doc,doc.sents and doc.text

After Iterating over doc what ouput i will get

 words or token

After Iterating over doc.sents what ouput i will get

`Sentences in the processed text`

Can i iterate over doc.text

no
Contains the original text or entire text as a plain string
exact input string text

After iterating doc i will get output token, from token what are the more things i will get

nlp = spacy.load("en_core_web_sm")

# Example text
text = "Apple is looking at buying U.K. startup for $1 billion. The company has been growing rapidly."

# Process the text with spaCy
doc = nlp(text)

# Iterate over each token and print various attributes
for token in doc:
    print(f"Token: {token.text}")
    print(f"Lemma: {token.lemma_}")
    print(f"POS: {token.pos_}")
    print(f"Tag: {token.tag_}")
    print(f"Dep: {token.dep_}")
    print(f"Head: {token.head.text}")
    print(f"Is Stop: {token.is_stop}")
    print(f"Is Alpha: {token.is_alpha}")
    print(f"Is Digit: {token.is_digit}")
    print(f"Shape: {token.shape_}")
    print(f"Is Punct: {token.is_punct}")
    print(f"Ent Type: {token.ent_type_}")
    print(f"Ent IOB: {token.ent_iob_}")
    print(f"Vector (first 10 elements): {token.vector[:10]}")  # Only first 10 elements for display
    print("-" * 40)

Token: Apple
Lemma: Apple
POS: PROPN
Tag: NNP
Dep: nsubj
Head: looking
Is Stop: False
Is Alpha: True
Is Digit: False
Shape: Xxxxx
Is Punct: False
Ent Type: ORG
Ent IOB: B
Vector (first 10 elements): [0.23762972 0.0010471  0.10166511 -0.23210286 -0.03329169  0.02825635
  0.22114269 -0.07173653 -0.00645234 -0.08234595]
----------------------------------------
Token: is
Lemma: be
POS: AUX
Tag: VBZ
Dep: ROOT
Head: looking
Is Stop: True
Is Alpha: True
Is Digit: False
Shape: xx
Is Punct: False
Ent Type: 
Ent IOB: O
Vector (first 10 elements): [-0.16527251 -0.05125302 -0.04848239  0.06471313  0.23750917 -0.15849067
  0.09828461 -0.17527371  0.12672948  0.04478811]
----------------------------------------
Token: looking
Lemma: look
POS: VERB
Tag: VBG
Dep: ROOT
Head: looking
Is Stop: False
Is Alpha: True
Is Digit: False
Shape: xxxx
Is Punct: False
Ent Type: 
Ent IOB: O
Vector (first 10 elements): [ 0.00665388 -0.26470256 -0.25616258  0.11279883  0.13066134 -0.05102386
  0.00930824  0.07289567 -0.05367638  0.03498989]
----------------------------------------

To get attribute, what are the possible rhings i have to iterate over doc

for token in doc
for sentence in doc.sents
for ent in doc.ents:
for np in doc.noun_chunks:
for component in nlp.pipe_names:

create a DataFrame from the tokens data after iterating over doc

import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion. The company has been growing rapidly."
doc = nlp(text)
tokens_data = []
for token in doc:
    tokens_data.append({
        "Token": token.text,
        "Lemma": token.lemma_,
        "POS": token.pos_,
        "Tag": token.tag_,
        "Dep": token.dep_,
        "Head": token.head.text,
        "Is Stop": token.is_stop,
        "Is Alpha": token.is_alpha,
        "Is Digit": token.is_digit,
        "Shape": token.shape_,
        "Is Punct": token.is_punct,
        "Ent Type": token.ent_type_,
        "Ent IOB": token.ent_iob_,
        "Vector (first 10)": token.vector[:10]  # Show only first 10 elements of the vector
    })
df = pd.DataFrame(tokens_data)
print(df)