Regular expressions (regex) are widely used in machine learning for text preprocessing and feature extraction. Below are eight real-time applications of using the sub (substitution) regex function in machine learning, along with examples and expected outputs:
Text Cleaning:
Example: Removing special characters, punctuation, or unwanted symbols from text data.
Output: Cleaned text data with undesired characters replaced or removed.
import re
text = "Hello, this is an example!!!"
cleaned_text = re.sub(r'[^\w\s]', '', text)
print("Cleaned Text:", cleaned_text)
Output: Cleaned Text: Hello this is an example
Normalization:
Example: Replacing multiple spaces with a single space to normalize text.
Output: Text with multiple spaces replaced by a single space.
import re
text = "This is a sample text."
normalized_text = re.sub(r'\s+', ' ', text)
print("Normalized Text:", normalized_text)
Output: Normalized Text: This is a sample text
.
Entity Recognition:
Example: Masking or replacing entity names (e.g., person names, locations) in text for privacy or anonymization.
Output: Text with entities substituted or masked.
import re
text = "John Doe visited New York City."
anonymized_text = re.sub(r'\bJohn Doe\b', 'Anonymous', text)
print("Anonymized Text:", anonymized_text)
Output: Anonymized Text: Anonymous visited New York City.
Text Augmentation:
Example: Generating additional training data by replacing synonyms or similar words.
Output: Augmented text data with word substitutions.
import re
import random
text = "The quick brown fox jumps over the lazy dog."
synonyms = {"quick": ["fast", "swift", "speedy"]}
word_to_replace = random.choice(list(synonyms.keys()))
augmented_text = re.sub(r'\b{}\b'.format(word_to_replace), random.choice(synonyms[word_to_replace]), text)
print("Augmented Text:", augmented_text)
Output: (Example output varies based on the randomly chosen synonym)
Certainly, here are the remaining four real-time applications of the sub regex function in machine learning:
Data Masking for Privacy:
Example: Masking sensitive information such as credit card numbers or social security numbers in text data.
Output: Text with sensitive information replaced or masked.
python
text = "My credit card number is 1234-5678-9012-3456."
masked_text = re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', 'XXXX-XXXX-XXXX-XXXX', text)
print("Masked Text:", masked_text)
Output: Masked Text: My credit card number is XXXX-XXXX-XXXX-XXXX
.
Removing HTML Tags:
Example: Stripping HTML tags from web data before natural language processing.
Output: Text without HTML tags.
import re
html_text = "<p>This is <b>HTML</b> text.</p>"
clean_text = re.sub(r'<.*?>', '', html_text)
print("Cleaned Text:", clean_text)
Output: Cleaned Text: This is HTML text.
Text Translation:
Example: Replacing text in one language with text in another language for translation tasks.
Output: Text translated to the target language.
import re
text = "Hello, how are you?"
translation = re.sub(r'Hello', 'Bonjour', text)
print("Translation:", translation)
Output: Translation: Bonjour, how are you?
Feature Engineering:
Example: Extracting specific patterns or entities from text as features for machine learning models.
Output: Extracted features or patterns.
import re
text = "Email me at john@example.com or call me at +1 (123) 456-7890."
email = re.sub(r'\S+@\S+', 'EMAIL', text)
phone = re.sub(r'\+\d{1,3}\s?\(\d{1,4}\)\s?\d{1,4}-\d{1,4}', 'PHONE', email)
print("Processed Text:", phone)
Output: Processed Text: Email me at EMAIL or call me at PHONE.
These examples illustrate how the sub regex function can be applied in various machine learning tasks for text preprocessing, data cleaning, and feature extraction. The function is versatile and can be customized to suit specific needs depending on the task at hand.
==============================================
IMPORTANT QUESTION/ASSIGNMENT
Write a Python program to replace all occurrences of a space, comma, or dot with a colon. using regx function
import re
# Function to replace spaces, commas, and dots with colons using regex
def replace_spaces_commas_dots_with_colons_regex(input_string):
# Define a regex pattern to match spaces, commas, and dots
pattern = r'[ ,.]'
# Use re.sub() to replace the matches with colons
result_string = re.sub(pattern, ':', input_string)
return result_string
# Input string
input_string = "This is a sample, string. It has spaces, commas and dots."
# Replace spaces, commas, and dots with colons using regex
modified_string = replace_spaces_commas_dots_with_colons_regex(input_string)
# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
==========or================
Function to replace spaces, commas, and dots with colons
def replace_spaces_commas_dots_with_colons(input_string):
# Replace space, comma, and dot with colon using str.replace()
result_string = input_string.replace(' ', ':').replace(',', ':').replace('.', ':')
return result_string
# Input string
input_string = "This is a sample, string. It has spaces, commas and dots."
# Replace spaces, commas, and dots with colons
modified_string = replace_spaces_commas_dots_with_colons(input_string)
# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Write a Python program to find all words starting with 'a' or 'e' in a given string using regx then replace it with constant using regx function
import re
# Function to replace words starting with 'a' or 'e' with a constant using regex
def replace_words_starting_with_a_or_e(input_string, constant):
# Define a regex pattern to match words starting with 'a' or 'e'
pattern = r'\b[aeAE]\w*\b'
# Use re.sub() to replace the matches with the constant
result_string = re.sub(pattern, constant, input_string)
return result_string
# Input string
input_string = "Apples are awesome, and elephants are enormous."
# Constant to replace matching words
replacement_constant = "WORD"
# Replace words starting with 'a' or 'e' with the constant using regex
modified_string = replace_words_starting_with_a_or_e(input_string, replacement_constant)
# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Write a Python program to find all words starting with 'a' or 'e' in a given string using regx then remove it with constant using regx function
import re
# Function to remove words starting with 'a' or 'e' by replacing them with a constant using regex
def remove_words_starting_with_a_or_e(input_string, constant):
# Define a regex pattern to match words starting with 'a' or 'e'
pattern = r'\b[aeAE]\w*\b'
# Use re.sub() to replace the matches with the constant (which effectively removes them)
result_string = re.sub(pattern, constant, input_string)
return result_string
# Input string
input_string = "Apples are awesome, and elephants are enormous."
# Constant to replace and remove matching words
removal_constant = ""
# Remove words starting with 'a' or 'e' by replacing them with the constant using regex
modified_string = remove_words_starting_with_a_or_e(input_string, removal_constant)
# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Create a function in Python to remove the parenthesis in a list of strings. The use of the re.compile() method is mandatory
Sample Text: ["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]
Expected Output:
example.com
hr@fliprobo.com
github.com
Hello Data Science World
Data Scientist
import re
# List of strings with paragraphs and parentheses
strings_with_parentheses = [
"(apple)\nThis is a paragraph with (some text) inside.",
"(banana)\nAnother paragraph with (more text).",
"(orange)\nYet another paragraph (with content)."
]
# Use re.sub to remove parentheses in each string
strings_without_parentheses = [re.sub(r'[()]', '', s, flags=re.MULTILINE) for s in strings_with_parentheses]
# Display the result
for i, string in enumerate(strings_without_parentheses, start=1):
print(f"String {i}:\n{string}\n")
Output:
String 1:
apple
This is a paragraph with some text inside.
String 2:
banana
Another paragraph with more text.
String 3:
orange
Yet another paragraph with content.
Question 6- Write a python program to remove the parenthesis area from the text stored in the text file using Regular Expression.
Sample Text: ["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]
Expected Output: ["example", "hr@fliprobo", "github", "Hello", "Data"]
Note- Store given sample text in the text file and then to remove the parenthesis area from the text.
import re
# Function to remove text within parentheses using regex
def remove_text_in_parentheses(input_text):
# Define a regex pattern to match and remove text within parentheses
pattern = re.compile(r'\([^)]*\)')
# Use re.sub() to replace matches with an empty string
result_text = pattern.sub('', input_text)
return result_text
# Read the content of the input file
input_file_name = 'input.txt'
output_file_name = 'output.txt'
try:
with open(input_file_name, 'r') as input_file:
input_text = input_file.read()
except FileNotFoundError:
print(f"Error: The input file '{input_file_name}' does not exist.")
exit(1)
# Remove text within parentheses
modified_text = remove_text_in_parentheses(input_text)
# Write the modified text to the output file
with open(output_file_name, 'w') as output_file:
output_file.write(modified_text)
print(f"Text with parentheses removed has been saved to '{output_file_name}'.")
Create a function in python to insert spaces between words starting with numbers.
Sample Text: “RegularExpression1IsAn2ImportantTopic3InPython"
Expected Output: RegularExpression 1IsAn 2ImportantTopic 3InPython
import re
# Example input text
input_text = "Replace spaces between 123 numbers 456 starting 789 with 987."
# Define the regular expression pattern to match spaces between words starting with numbers
pattern = re.compile(r'(\b\d+)\s+(\w+)\b')
# Use re.sub to replace matched spaces with a specific separator (e.g., '-')
result_text = re.sub(pattern, r'\1-\2', input_text)
# Output
print(result_text)
Replace spaces between 123-numbers 456-starting 789-with 987.
Create a function in python to insert spaces between words starting with capital letters or with numbers.
Sample Text: “RegularExpression1IsAn2ImportantTopic3InPython"
Expected Output: RegularExpression 1 IsAn 2 ImportantTopic 3 InPython
import re
# Function to insert spaces between words starting with capital letters or numbers
def insert_spaces_between_capital_and_numbers(text):
# Define a regex pattern to match words that start with capital letters or numbers
pattern = re.compile(r'\b([A-Z\d]\w*)\b')
# Use re.sub() to insert a space before each matched word
result_text = pattern.sub(r' \1', text)
return result_text
# Sample text
sample_text = "RegularExpression1IsAn2ImportantTopic3InPython"
# Insert spaces before words starting with capital letters or numbers
modified_text = insert_spaces_between_capital_and_numbers(sample_text)
# Print the modified text
print("Original Text:")
print(sample_text)
print("\nModified Text:")
print(modified_text)
Write a Python program to remove leading zeros from an IP address using regx fun
import re
# Example IP address with leading zeros
ip_address = "192.010.001.001"
# Define the regular expression pattern to remove leading zeros from each octet
pattern = re.compile(r'\b0+(\d+)\b')
# Use re.sub to remove leading zeros
result_ip = re.sub(pattern, r'\1', ip_address)
# Output
print(result_ip)
output
192.10.1.1
Write a Python program to convert a date of yyyy-mm-dd format to dd-mm-yyyy format. using regx fun
import re
# Function to convert yyyy-mm-dd to dd-mm-yyyy format using regex
def convert_date_format(input_date):
# Define a regex pattern to match yyyy-mm-dd format
pattern = r'(\d{4})-(\d{2})-(\d{2})'
# Use re.sub() to replace the matched date with dd-mm-yyyy format
converted_date = re.sub(pattern, r'\3-\2-\1', input_date)
return converted_date
# Input date in yyyy-mm-dd format
input_date = "2023-09-18"
# Convert the date format
converted_date = convert_date_format(input_date)
# Print the converted date
print("Original Date (yyyy-mm-dd):", input_date)
print("Converted Date (dd-mm-yyyy):", converted_date)
Create a function in python to insert spaces between words starting with capital letters.
import re
# Function to insert spaces between words starting with capital letters
def insert_spaces_between_capital_words(input_string):
# Define a regex pattern to find words starting with capital letters
pattern = re.compile(r'([A-Z][a-z]*)')
# Use re.sub() to insert spaces between such words
spaced_string = re.sub(pattern, r' \1', input_string)
# Remove leading space if present
spaced_string = spaced_string.lstrip()
return spaced_string
# Test string
test_string = "ThisIsCamelCaseTextInPython"
# Insert spaces between words starting with capital letters
spaced_text = insert_spaces_between_capital_words(test_string)
# Print the spaced text
print("Original String:")
print(test_string)
print("\nSpaced String:")
print(spaced_text)
Write a Python program to remove continuous duplicate words from Sentence using Regular Expression.
import re
# Function to remove continuous duplicate words from a sentence
def remove_continuous_duplicates(sentence):
# Define a regex pattern to match continuous duplicate words
pattern = r'\b(\w+)(?:\s+\1)+\b'
# Use re.sub() to remove continuous duplicates
cleaned_sentence = re.sub(pattern, r'\1', sentence)
return cleaned_sentence
# Test sentence
test_sentence = "This is is a test test sentence with duplicate duplicate words."
# Remove continuous duplicate words from the sentence
cleaned_sentence = remove_continuous_duplicates(test_sentence)
# Print the cleaned sentence
print("Original Sentence:")
print(test_sentence)
print("\nCleaned Sentence:")
print(cleaned_sentence)
Create a function in python to remove all words from a string of length between 2 and 4.
The use of the re.compile() method is mandatory.
Sample Text: "The following example creates an ArrayList with a capacity of 50 elements. 4 elements are then added to the ArrayList and the ArrayList is trimmed accordingly."
Expected Output: following example creates ArrayList a capacity elements. 4 elements added ArrayList ArrayList trimmed accordingly.
input_text = "Replace short words in this text with a longer version."
# Replacement text
replacement_text = "REPLACEMENT"
# Define the regular expression pattern to match words of length 2 to 4
pattern = re.compile(r'\b\w{2,4}\b')
# Use re.sub to replace matched words with the specified replacement
result_text = re.sub(pattern, replacement_text, input_text)
# Output
print(result_text)
Replace short words REPLACEMENT REPLACEMENT REPLACEMENT REPLACEMENT a longer version.
Write a python program using RegEx to remove like symbols
Check the below sample text, there are strange symbols something of the sort all over the place. You need to come up with a general Regex expression that will cover all such symbols.
Sample Text: "@Jags123456 Bharat band on 28??Those who are protesting #demonetization are all different party leaders" using regx function
import re
# Sample text
sample_text = "@Jags123456 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who are protesting #demonetization are all different party leaders"
# Define a regex pattern to match <U+..> symbols
pattern = r'<U\+[0-9A-Fa-f]+>'
# Use re.sub() to remove the matched symbols
cleaned_text = re.sub(pattern, '', sample_text)
# Print the cleaned text
print("Original Text:")
print(sample_text)
print("\nCleaned Text:")
print(cleaned_text)
find vowel char then replace with const using python regx function
write input text as variable
replacement char as variable
call function and stored in variable
print input text and stored variable
now define the function
Define a regular expression pattern to match vowels using re.compile
Use re.sub to replace vowels with the specified character and stored in variable then return this variable
Common Mistake
no semicolon after print
not use{} after def function
use: colon after def(id):
'' inside bracket re.comile('[AEIOUaeiou]')
import re
import re
def replace_vowels(input_string, replacement_char):
# Define a regular expression pattern to match vowels
vowel_pattern = re.compile('[aeiouAEIOU]')
# Use re.sub to replace vowels with the specified character
result_string = re.sub(vowel_pattern, replacement_char, input_string)
return result_string
# Example usage
input_text = "Hello, World! This is a sample string with vowels."
replacement_char = '*'
result_text = replace_vowels(input_text, replacement_char)
# Print the original and modified strings
print("Original Text: ", input_text)
print("Modified Text: ", result_text)
Output
Original Text: Hello, World! This is a sample string with vowels.
Modified Text: H*ll*, W*rld! Th*s *s * s*mpl* str*ng w*th v*w*ls
Summary
Text Cleaning:Removing special characters, punctuation, or unwanted symbols from text data
Normalization:Replacing multiple spaces with a single space
Entity Recognition: Masking or replacing entity names
Text Augmentation:additional training data by replacing synonyms or similar words
Data Masking:Masking sensitive information such as credit card numbers or social security numbers in text data
Extracting specific patterns(john@example.com) and replace it by (+1 (123) 456-7890)
replace all occurrences of a space, comma, or dot with a colon
find all words starting with 'a' or 'e' in a given string using regx then replace it with constant
remove the parenthesis in a list of strings
insert spaces between words starting with numbers
insert spaces between words starting with capital letters or with numbers
remove leading zeros from an IP address
convert a date of yyyy-mm-dd format to dd-mm-yyyy format
remove all words from a string of length between 2 and 4.
Answer
re.sub(r'[^\w\s]', '', text)
normalized_text = re.sub(r'\s+', ' ', text)
anonymized_text = re.sub(r'\bJohn Doe\b', 'Anonymous', text)
re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', 'XXXX-XXXX-XXXX-XXXX', text)
email = re.sub(r'\S+@\S+', 'EMAIL', text)
phone = re.sub(r'\+\d{1,3}\s?\(\d{1,4}\)\s?\d{1,4}-\d{1,4}', 'PHONE', email)
pattern = r'[ ,.]'
result_string = re.sub(pattern, ':', input_string)
=================or======================
result_string = input_string.replace(' ', ':').replace(',', ':').replace('.', ':')
pattern = r'\b[aeiouAEIOU]\w*\b'
result_string = re.sub(pattern, constant, input_string)
strings_without_parentheses = [re.sub(r'[()]', '', s, flags=re.MULTILINE) for s in strings_with_parentheses]
pattern = re.compile(r'\b(\d\w+)\b')
pattern = re.compile(r'(\b\d+)\s+(\w+)\b')
result_text = re.sub(pattern, r'\1-\2', input_text)
pattern = re.compile(r'\b([A-Z\d]\w*)\b')
result_text = pattern.sub(r' \1', text)
pattern = re.compile(r'\b0+(\d+)\b')
result_ip = re.sub(pattern, r'\1', ip_address)
pattern = r'(\d{4})-(\d{2})-(\d{2})'
converted_date = re.sub(pattern, r'\3-\2-\1', input_date)
pattern = re.compile(r'\b\w{2,4}\b')
result_text = re.sub(pattern, replacement_text, input_text)
Top comments (0)