Debug School

rakesh kumar
rakesh kumar

Posted on

Advanced nlp technique to text processing, representation, and understanding

Bag of Words (BoW)
Concept: BoW is a way of representing text by counting the occurrences of each word in a document, disregarding grammar and order.

Usage: Useful for text classification, information retrieval, and basic NLP tasks.

Example:

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Dr. Strange loves pav bhaji of Mumbai.", "Hulk loves chat of Delhi."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Representation:\n", X.toarray())
Enter fullscreen mode Exit fullscreen mode

Output:

Vocabulary: {'strange': 6, 'loves': 2, 'pav': 4, 'bhaji': 0, 'of': 3, 'mumbai': 5, 'hulk': 1, 'chat': 7, 'delhi': 8}
BoW Representation:
[[1 0 1 1 1 1 1 0 0]
 [0 1 1 1 0 0 0 1 1]]
Enter fullscreen mode Exit fullscreen mode

This representation encodes each sentence by counting occurrences of each word in the corpus vocabulary.

  1. Word2Vec Concept: Word2Vec is a neural network-based word embedding technique that represents words in vector space based on their context in a large corpus.

Usage: Captures semantic relationships between words, useful for similarity, clustering, and other ML tasks.

Example (Using gensim library):

from gensim.models import Word2Vec

sentences = [["dr", "strange", "loves", "pav", "bhaji", "of", "mumbai"],
             ["hulk", "loves", "chat", "of", "delhi"]]
model = Word2Vec(sentences, vector_size=50, min_count=1, window=3)

print("Vector for 'strange':", model.wv['strange'])
print("Similarity between 'loves' and 'bhaji':", model.wv.similarity('loves', 'bhaji'))
Enter fullscreen mode Exit fullscreen mode

Output:

Vector for 'strange': [0.123, -0.34, ..., 0.87]  # A vector of 50 dimensions
Enter fullscreen mode Exit fullscreen mode

Similarity between 'loves' and 'bhaji': 0.35 # Cosine similarity score
Word2Vec helps capture context-based meaning, allowing calculations of similarity between words.

  1. n-grams Concept: n-grams are sequences of n words used together in text. They can capture context and are commonly used in language models.

Usage: Useful for phrase detection, context analysis, and language modeling.

Example:

from nltk import ngrams

text = "Dr. Strange loves pav bhaji of Mumbai"
bigrams = list(ngrams(text.split(), 2))
trigrams = list(ngrams(text.split(), 3))

print("Bigrams:", bigrams)
print("Trigrams:", trigrams)
Enter fullscreen mode Exit fullscreen mode

Output:

Bigrams: [('Dr.', 'Strange'), ('Strange', 'loves'), ('loves', 'pav'), ('pav', 'bhaji'), ('bhaji', 'of'), ('of', 'Mumbai')]
Trigrams: [('Dr.', 'Strange', 'loves'), ('Strange', 'loves', 'pav'), ('loves', 'pav', 'bhaji'), ('pav', 'bhaji', 'of'), ('bhaji', 'of', 'Mumbai')]
Enter fullscreen mode Exit fullscreen mode

n-grams capture consecutive word sequences, which can be useful for identifying common phrases or linguistic patterns.

  1. Stop Words Removal Concept: Stop words are common words like "and," "the," and "is" that often do not add meaningful value to text processing.

Usage: Reduces noise in the data, helpful for text classification and information retrieval.

Example:

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
text = "Dr. Strange loves pav bhaji of Mumbai"
filtered_words = [word for word in text.split() if word.lower() not in stop_words]

print("Filtered Words:", filtered_words)
Enter fullscreen mode Exit fullscreen mode

Output:

Filtered Words: ['Dr.', 'Strange', 'loves', 'pav', 'bhaji', 'Mumbai']
Enter fullscreen mode Exit fullscreen mode

Stop words are removed to simplify the text and focus on meaningful words.

  1. Word Embeddings (General Concept) Concept: Word embeddings are vector representations of words that capture their meanings based on context. Unlike BoW, embeddings preserve semantic relationships.

Usage: Used in advanced NLP tasks like sentiment analysis, topic modeling, and machine translation.

Example: Word embeddings can be loaded from pre-trained models like GloVe or fastText.

# Example using pre-trained GloVe embeddings
from gensim.models import KeyedVectors

# Load pre-trained embeddings (make sure to download beforehand)
glove_model = KeyedVectors.load_word2vec_format("glove.6B.50d.txt", binary=False)

print("Vector for 'Mumbai':", glove_model['Mumbai'])
print("Similarity between 'pav' and 'bhaji':", glove_model.similarity('pav', 'bhaji'))
Enter fullscreen mode Exit fullscreen mode

This code (if the embeddings file is available) loads GloVe embeddings and allows operations like similarity.

  1. Tokenization and Sentence Splitting Concept: Splitting text into sentences or words.

Usage: Essential for preprocessing in almost any NLP task.

Example:

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Dr. Strange loves pav bhaji of Mumbai. Hulk loves chat of Delhi."
sentences = sent_tokenize(text)
words = word_tokenize(text)

print("Sentences:", sentences)
print("Words:", words)
Enter fullscreen mode Exit fullscreen mode

Output:

Sentences: ['Dr. Strange loves pav bhaji of Mumbai.', 'Hulk loves chat of Delhi.']
Words: ['Dr.', 'Strange', 'loves', 'pav', 'bhaji', 'of', 'Mumbai', '.', 'Hulk', 'loves', 'chat', 'of', 'Delhi', '.']
Enter fullscreen mode Exit fullscreen mode

Helps break down text into manageable parts for analysis.

  1. TF-IDF (Term Frequency-Inverse Document Frequency) Concept: TF-IDF is a weighted BoW approach that emphasizes words unique to a document in a collection.

Usage: Helps identify important words in text, useful for information retrieval.

Example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["Dr. Strange loves pav bhaji of Mumbai.", "Hulk loves chat of Delhi."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("TF-IDF Representation:\n", X.toarray())
Enter fullscreen mode Exit fullscreen mode

Output:

TF-IDF Representation:
[[0.7071, 0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0, 0.0], [0.0, 0.7071, 0.5, 0.5, 0.0, 0.0, 0.0, 0.5, 0.5]]
Enter fullscreen mode Exit fullscreen mode

Highlights unique terms in each document.

  1. Text Normalization (Lowercasing and Lemmatization) Concept: Converts text to a normalized form to reduce variability in text.

Usage: Helps simplify text, especially for ML models.

Example:

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["loves", "loving", "loved"]
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]

print("Lemmatized Words:", lemmas)
Enter fullscreen mode Exit fullscreen mode

Output:

Lemmatized Words: ['love', 'love', 'love']
Enter fullscreen mode Exit fullscreen mode

Top comments (0)