Bag of Words (BoW)
Concept
: BoW is a way of representing text by counting the occurrences of each word in a document, disregarding grammar and order.
Usage
: Useful for text classification, information retrieval, and basic NLP tasks.
Example:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["Dr. Strange loves pav bhaji of Mumbai.", "Hulk loves chat of Delhi."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Representation:\n", X.toarray())
Output
:
Vocabulary: {'strange': 6, 'loves': 2, 'pav': 4, 'bhaji': 0, 'of': 3, 'mumbai': 5, 'hulk': 1, 'chat': 7, 'delhi': 8}
BoW Representation:
[[1 0 1 1 1 1 1 0 0]
[0 1 1 1 0 0 0 1 1]]
This representation encodes each sentence by counting occurrences of each word in the corpus vocabulary.
- Word2Vec Concept: Word2Vec is a neural network-based word embedding technique that represents words in vector space based on their context in a large corpus.
Usage
: Captures semantic relationships between words, useful for similarity, clustering, and other ML tasks.
Example (Using gensim library):
from gensim.models import Word2Vec
sentences = [["dr", "strange", "loves", "pav", "bhaji", "of", "mumbai"],
["hulk", "loves", "chat", "of", "delhi"]]
model = Word2Vec(sentences, vector_size=50, min_count=1, window=3)
print("Vector for 'strange':", model.wv['strange'])
print("Similarity between 'loves' and 'bhaji':", model.wv.similarity('loves', 'bhaji'))
Output
:
Vector for 'strange': [0.123, -0.34, ..., 0.87] # A vector of 50 dimensions
Similarity between 'loves' and 'bhaji': 0.35 # Cosine similarity score
Word2Vec helps capture context-based meaning, allowing calculations of similarity between words.
- n-grams Concept: n-grams are sequences of n words used together in text. They can capture context and are commonly used in language models.
Usage
: Useful for phrase detection, context analysis, and language modeling.
Example:
from nltk import ngrams
text = "Dr. Strange loves pav bhaji of Mumbai"
bigrams = list(ngrams(text.split(), 2))
trigrams = list(ngrams(text.split(), 3))
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)
Output
:
Bigrams: [('Dr.', 'Strange'), ('Strange', 'loves'), ('loves', 'pav'), ('pav', 'bhaji'), ('bhaji', 'of'), ('of', 'Mumbai')]
Trigrams: [('Dr.', 'Strange', 'loves'), ('Strange', 'loves', 'pav'), ('loves', 'pav', 'bhaji'), ('pav', 'bhaji', 'of'), ('bhaji', 'of', 'Mumbai')]
n-grams capture consecutive word sequences, which can be useful for identifying common phrases or linguistic patterns.
- Stop Words Removal Concept: Stop words are common words like "and," "the," and "is" that often do not add meaningful value to text processing.
Usage: Reduces noise in the data, helpful for text classification and information retrieval.
Example:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
text = "Dr. Strange loves pav bhaji of Mumbai"
filtered_words = [word for word in text.split() if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)
Output
:
Filtered Words: ['Dr.', 'Strange', 'loves', 'pav', 'bhaji', 'Mumbai']
Stop words are removed to simplify the text and focus on meaningful words.
- Word Embeddings (General Concept) Concept: Word embeddings are vector representations of words that capture their meanings based on context. Unlike BoW, embeddings preserve semantic relationships.
Usage
: Used in advanced NLP tasks like sentiment analysis, topic modeling, and machine translation.
Example: Word embeddings can be loaded from pre-trained models like GloVe or fastText.
# Example using pre-trained GloVe embeddings
from gensim.models import KeyedVectors
# Load pre-trained embeddings (make sure to download beforehand)
glove_model = KeyedVectors.load_word2vec_format("glove.6B.50d.txt", binary=False)
print("Vector for 'Mumbai':", glove_model['Mumbai'])
print("Similarity between 'pav' and 'bhaji':", glove_model.similarity('pav', 'bhaji'))
This code (if the embeddings file is available) loads GloVe embeddings and allows operations like similarity.
- Tokenization and Sentence Splitting Concept: Splitting text into sentences or words.
Usage: Essential for preprocessing in almost any NLP task.
Example:
from nltk.tokenize import sent_tokenize, word_tokenize
text = "Dr. Strange loves pav bhaji of Mumbai. Hulk loves chat of Delhi."
sentences = sent_tokenize(text)
words = word_tokenize(text)
print("Sentences:", sentences)
print("Words:", words)
Output
:
Sentences: ['Dr. Strange loves pav bhaji of Mumbai.', 'Hulk loves chat of Delhi.']
Words: ['Dr.', 'Strange', 'loves', 'pav', 'bhaji', 'of', 'Mumbai', '.', 'Hulk', 'loves', 'chat', 'of', 'Delhi', '.']
Helps break down text into manageable parts for analysis.
- TF-IDF (Term Frequency-Inverse Document Frequency) Concept: TF-IDF is a weighted BoW approach that emphasizes words unique to a document in a collection.
Usage: Helps identify important words in text, useful for information retrieval.
Example:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Dr. Strange loves pav bhaji of Mumbai.", "Hulk loves chat of Delhi."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print("TF-IDF Representation:\n", X.toarray())
Output
:
TF-IDF Representation:
[[0.7071, 0.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.0, 0.0], [0.0, 0.7071, 0.5, 0.5, 0.0, 0.0, 0.0, 0.5, 0.5]]
Highlights unique terms in each document.
- Text Normalization (Lowercasing and Lemmatization) Concept: Converts text to a normalized form to reduce variability in text.
Usage: Helps simplify text, especially for ML models.
Example:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["loves", "loving", "loved"]
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
print("Lemmatized Words:", lemmas)
Output
:
Lemmatized Words: ['love', 'love', 'love']
Top comments (0)