Debug School

rakesh kumar
rakesh kumar

Posted on • Updated on

Explain real time application of sub regex function in ml

Regular expressions (regex) are widely used in machine learning for text preprocessing and feature extraction. Below are eight real-time applications of using the sub (substitution) regex function in machine learning, along with examples and expected outputs:

Text Cleaning:

Example: Removing special characters, punctuation, or unwanted symbols from text data.
Output: Cleaned text data with undesired characters replaced or removed.

import re

text = "Hello, this is an example!!!"
cleaned_text = re.sub(r'[^\w\s]', '', text)
print("Cleaned Text:", cleaned_text)
Enter fullscreen mode Exit fullscreen mode
Output: Cleaned Text: Hello this is an example
Enter fullscreen mode Exit fullscreen mode

Normalization:

Example: Replacing multiple spaces with a single space to normalize text.
Output: Text with multiple spaces replaced by a single space.

import re

text = "This      is     a    sample  text."
normalized_text = re.sub(r'\s+', ' ', text)
print("Normalized Text:", normalized_text)
Enter fullscreen mode Exit fullscreen mode
Output: Normalized Text: This is a sample text
Enter fullscreen mode Exit fullscreen mode

.

Entity Recognition:

Example: Masking or replacing entity names (e.g., person names, locations) in text for privacy or anonymization.
Output: Text with entities substituted or masked.

import re

text = "John Doe visited New York City."
anonymized_text = re.sub(r'\bJohn Doe\b', 'Anonymous', text)
print("Anonymized Text:", anonymized_text)
Enter fullscreen mode Exit fullscreen mode
Output: Anonymized Text: Anonymous visited New York City.
Enter fullscreen mode Exit fullscreen mode

Text Augmentation:

Example: Generating additional training data by replacing synonyms or similar words.
Output: Augmented text data with word substitutions.

import re
import random

text = "The quick brown fox jumps over the lazy dog."
synonyms = {"quick": ["fast", "swift", "speedy"]}
word_to_replace = random.choice(list(synonyms.keys()))
augmented_text = re.sub(r'\b{}\b'.format(word_to_replace), random.choice(synonyms[word_to_replace]), text)
print("Augmented Text:", augmented_text)

Enter fullscreen mode Exit fullscreen mode
Output: (Example output varies based on the randomly chosen synonym)
Enter fullscreen mode Exit fullscreen mode

Certainly, here are the remaining four real-time applications of the sub regex function in machine learning:

Data Masking for Privacy:

Example: Masking sensitive information such as credit card numbers or social security numbers in text data.
Output: Text with sensitive information replaced or masked.
python

text = "My credit card number is 1234-5678-9012-3456."
masked_text = re.sub(r'\d{4}-\d{4}-\d{4}-\d{4}', 'XXXX-XXXX-XXXX-XXXX', text)
print("Masked Text:", masked_text)
Enter fullscreen mode Exit fullscreen mode
Output: Masked Text: My credit card number is XXXX-XXXX-XXXX-XXXX
Enter fullscreen mode Exit fullscreen mode

.

Removing HTML Tags:

Example: Stripping HTML tags from web data before natural language processing.
Output: Text without HTML tags.

import re

html_text = "<p>This is <b>HTML</b> text.</p>"
clean_text = re.sub(r'<.*?>', '', html_text)
print("Cleaned Text:", clean_text)
Output: Cleaned Text: This is HTML text.
Enter fullscreen mode Exit fullscreen mode

Text Translation:

Example: Replacing text in one language with text in another language for translation tasks.
Output: Text translated to the target language.

import re

text = "Hello, how are you?"
translation = re.sub(r'Hello', 'Bonjour', text)
print("Translation:", translation)
Output: Translation: Bonjour, how are you?
Enter fullscreen mode Exit fullscreen mode

Feature Engineering:

Example: Extracting specific patterns or entities from text as features for machine learning models.
Output: Extracted features or patterns.

import re

text = "Email me at john@example.com or call me at +1 (123) 456-7890."
email = re.sub(r'\S+@\S+', 'EMAIL', text)
phone = re.sub(r'\+\d{1,3}\s?\(\d{1,4}\)\s?\d{1,4}-\d{1,4}', 'PHONE', email)
print("Processed Text:", phone)
Output: Processed Text: Email me at EMAIL or call me at PHONE.
Enter fullscreen mode Exit fullscreen mode

These examples illustrate how the sub regex function can be applied in various machine learning tasks for text preprocessing, data cleaning, and feature extraction. The function is versatile and can be customized to suit specific needs depending on the task at hand.

Image description

==============================================

IMPORTANT QUESTION/ASSIGNMENT

Write a Python program to replace all occurrences of a space, comma, or dot with a colon. using regx function

import re

# Function to replace spaces, commas, and dots with colons using regex
def replace_spaces_commas_dots_with_colons_regex(input_string):
    # Define a regex pattern to match spaces, commas, and dots
    pattern = r'[ ,.]'

    # Use re.sub() to replace the matches with colons
    result_string = re.sub(pattern, ':', input_string)

    return result_string

# Input string
input_string = "This is a sample, string. It has spaces, commas and dots."

# Replace spaces, commas, and dots with colons using regex
modified_string = replace_spaces_commas_dots_with_colons_regex(input_string)

# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Enter fullscreen mode Exit fullscreen mode

==========or================

Function to replace spaces, commas, and dots with colons

def replace_spaces_commas_dots_with_colons(input_string):
    # Replace space, comma, and dot with colon using str.replace()
    result_string = input_string.replace(' ', ':').replace(',', ':').replace('.', ':')

    return result_string

# Input string
input_string = "This is a sample, string. It has spaces, commas and dots."

# Replace spaces, commas, and dots with colons
modified_string = replace_spaces_commas_dots_with_colons(input_string)

# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Enter fullscreen mode Exit fullscreen mode

Write a Python program to find all words starting with 'a' or 'e' in a given string using regx then replace it with constant using regx function

import re

# Function to replace words starting with 'a' or 'e' with a constant using regex
def replace_words_starting_with_a_or_e(input_string, constant):
    # Define a regex pattern to match words starting with 'a' or 'e'
    pattern = r'\b[aeAE]\w*\b'

    # Use re.sub() to replace the matches with the constant
    result_string = re.sub(pattern, constant, input_string)

    return result_string

# Input string
input_string = "Apples are awesome, and elephants are enormous."

# Constant to replace matching words
replacement_constant = "WORD"

# Replace words starting with 'a' or 'e' with the constant using regex
modified_string = replace_words_starting_with_a_or_e(input_string, replacement_constant)

# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Enter fullscreen mode Exit fullscreen mode

Write a Python program to find all words starting with 'a' or 'e' in a given string using regx then remove it with constant using regx function

import re

# Function to remove words starting with 'a' or 'e' by replacing them with a constant using regex
def remove_words_starting_with_a_or_e(input_string, constant):
    # Define a regex pattern to match words starting with 'a' or 'e'
    pattern = r'\b[aeAE]\w*\b'

    # Use re.sub() to replace the matches with the constant (which effectively removes them)
    result_string = re.sub(pattern, constant, input_string)

    return result_string

# Input string
input_string = "Apples are awesome, and elephants are enormous."

# Constant to replace and remove matching words
removal_constant = ""

# Remove words starting with 'a' or 'e' by replacing them with the constant using regex
modified_string = remove_words_starting_with_a_or_e(input_string, removal_constant)

# Print the modified string
print("Original String:")
print(input_string)
print("\nModified String:")
print(modified_string)
Enter fullscreen mode Exit fullscreen mode

Create a function in Python to remove the parenthesis in a list of strings. The use of the re.compile() method is mandatory

Sample Text: ["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]
Expected Output:
example.com
hr@fliprobo.com
github.com
Hello Data Science World
Data Scientist

import re

# Function to remove parentheses in a list of strings using regex
def remove_parentheses(strings_list):
    # Define a regex pattern to match and remove parentheses
    pattern = re.compile(r'\([^)]*\)')

    # Use re.sub() to replace matches with an empty string for each string in the list
    result_list = [pattern.sub('', string) for string in strings_list]

    return result_list

# List of strings with parentheses
strings_with_parentheses = [
    "Hello (world)",
    "Python (is) a (powerful) language",
    "Remove (these) parentheses from (the) strings"
]

# Remove parentheses from the list of strings using regex
strings_without_parentheses = remove_parentheses(strings_with_parentheses)

# Print the modified list of strings
print("Original List:")
for string in strings_with_parentheses:
    print(string)

print("\nList without Parentheses:")
for string in strings_without_parentheses:
    print(string)
Enter fullscreen mode Exit fullscreen mode

Question 6- Write a python program to remove the parenthesis area from the text stored in the text file using Regular Expression.
Sample Text: ["example (.com)", "hr@fliprobo (.com)", "github (.com)", "Hello (Data Science World)", "Data (Scientist)"]
Expected Output: ["example", "hr@fliprobo", "github", "Hello", "Data"]
Note- Store given sample text in the text file and then to remove the parenthesis area from the text.

import re

# Function to remove text within parentheses using regex
def remove_text_in_parentheses(input_text):
    # Define a regex pattern to match and remove text within parentheses
    pattern = re.compile(r'\([^)]*\)')

    # Use re.sub() to replace matches with an empty string
    result_text = pattern.sub('', input_text)

    return result_text

# Read the content of the input file
input_file_name = 'input.txt'
output_file_name = 'output.txt'

try:
    with open(input_file_name, 'r') as input_file:
        input_text = input_file.read()
except FileNotFoundError:
    print(f"Error: The input file '{input_file_name}' does not exist.")
    exit(1)

# Remove text within parentheses
modified_text = remove_text_in_parentheses(input_text)

# Write the modified text to the output file
with open(output_file_name, 'w') as output_file:
    output_file.write(modified_text)

print(f"Text with parentheses removed has been saved to '{output_file_name}'.")
Enter fullscreen mode Exit fullscreen mode

Create a function in python to insert spaces between words starting with numbers.
Sample Text: “RegularExpression1IsAn2ImportantTopic3InPython"
Expected Output: RegularExpression 1IsAn 2ImportantTopic 3InPython

import re

# Function to insert spaces between words starting with numbers
def insert_spaces_before_numbers(text):
    # Define a regex pattern to match words that start with numbers
    pattern = re.compile(r'\b(\d\w+)\b')

    # Use re.sub() to insert a space before each matched word
    result_text = pattern.sub(r' \1', text)

    return result_text

# Sample text
sample_text = "RegularExpression1IsAn2ImportantTopic3InPython"

# Insert spaces before words starting with numbers
modified_text = insert_spaces_before_numbers(sample_text)

# Print the modified text
print("Original Text:")
print(sample_text)
print("\nModified Text:")
print(modified_text)
Enter fullscreen mode Exit fullscreen mode

Create a function in python to insert spaces between words starting with capital letters or with numbers.
Sample Text: “RegularExpression1IsAn2ImportantTopic3InPython"
Expected Output: RegularExpression 1 IsAn 2 ImportantTopic 3 InPython

import re

# Function to insert spaces between words starting with capital letters or numbers
def insert_spaces_between_capital_and_numbers(text):
    # Define a regex pattern to match words that start with capital letters or numbers
    pattern = re.compile(r'\b([A-Z\d]\w*)\b')

    # Use re.sub() to insert a space before each matched word
    result_text = pattern.sub(r' \1', text)

    return result_text

# Sample text
sample_text = "RegularExpression1IsAn2ImportantTopic3InPython"

# Insert spaces before words starting with capital letters or numbers
modified_text = insert_spaces_between_capital_and_numbers(sample_text)

# Print the modified text
print("Original Text:")
print(sample_text)
print("\nModified Text:")
print(modified_text)
Enter fullscreen mode Exit fullscreen mode

Write a Python program to remove leading zeros from an IP address using regx fun

import re

# Function to remove leading zeros from an IP address using regex
def remove_leading_zeros_from_ip(ip_address):
    # Define a regex pattern to match and remove leading zeros from each octet
    pattern = r'(\b|\.)0+(\d+)'

    # Use re.sub() to replace leading zeros with the matched digits
    cleaned_ip = re.sub(pattern, r'\1\2', ip_address)

    return cleaned_ip

# Test IP address with leading zeros
test_ip = "192.012.001.004"

# Remove leading zeros from the IP address
cleaned_ip = remove_leading_zeros_from_ip(test_ip)

# Print the cleaned IP address
print("Original IP Address:")
print(test_ip)
print("\nCleaned IP Address:")
print(cleaned_ip)
Enter fullscreen mode Exit fullscreen mode

output

Original IP Address:
192.012.001.004

Cleaned IP Address:
192.12.1.4
Enter fullscreen mode Exit fullscreen mode

Write a Python program to convert a date of yyyy-mm-dd format to dd-mm-yyyy format. using regx fun

import re

# Function to convert yyyy-mm-dd to dd-mm-yyyy format using regex
def convert_date_format(input_date):
    # Define a regex pattern to match yyyy-mm-dd format
    pattern = r'(\d{4})-(\d{2})-(\d{2})'

    # Use re.sub() to replace the matched date with dd-mm-yyyy format
    converted_date = re.sub(pattern, r'\3-\2-\1', input_date)

    return converted_date

# Input date in yyyy-mm-dd format
input_date = "2023-09-18"

# Convert the date format
converted_date = convert_date_format(input_date)

# Print the converted date
print("Original Date (yyyy-mm-dd):", input_date)
print("Converted Date (dd-mm-yyyy):", converted_date)
Enter fullscreen mode Exit fullscreen mode

Create a function in python to insert spaces between words starting with capital letters.

import re

# Function to insert spaces between words starting with capital letters
def insert_spaces_between_capital_words(input_string):
    # Define a regex pattern to find words starting with capital letters
    pattern = re.compile(r'([A-Z][a-z]*)')

    # Use re.sub() to insert spaces between such words
    spaced_string = re.sub(pattern, r' \1', input_string)

    # Remove leading space if present
    spaced_string = spaced_string.lstrip()

    return spaced_string

# Test string
test_string = "ThisIsCamelCaseTextInPython"

# Insert spaces between words starting with capital letters
spaced_text = insert_spaces_between_capital_words(test_string)

# Print the spaced text
print("Original String:")
print(test_string)
print("\nSpaced String:")
print(spaced_text)
Enter fullscreen mode Exit fullscreen mode

Write a Python program to remove continuous duplicate words from Sentence using Regular Expression.

import re

# Function to remove continuous duplicate words from a sentence
def remove_continuous_duplicates(sentence):
    # Define a regex pattern to match continuous duplicate words
    pattern = r'\b(\w+)(?:\s+\1)+\b'

    # Use re.sub() to remove continuous duplicates
    cleaned_sentence = re.sub(pattern, r'\1', sentence)

    return cleaned_sentence

# Test sentence
test_sentence = "This is is a test test sentence with duplicate duplicate words."

# Remove continuous duplicate words from the sentence
cleaned_sentence = remove_continuous_duplicates(test_sentence)

# Print the cleaned sentence
print("Original Sentence:")
print(test_sentence)
print("\nCleaned Sentence:")
print(cleaned_sentence)
Enter fullscreen mode Exit fullscreen mode

Create a function in python to remove all words from a string of length between 2 and 4.
The use of the re.compile() method is mandatory.
Sample Text: "The following example creates an ArrayList with a capacity of 50 elements. 4 elements are then added to the ArrayList and the ArrayList is trimmed accordingly."
Expected Output: following example creates ArrayList a capacity elements. 4 elements added ArrayList ArrayList trimmed accordingly.

import re

# Function to remove words of length 2 to 4 from a string using regex
def remove_words_of_length_2_to_4(input_string):
    # Define a regex pattern to match words of length 2 to 4
    pattern = r'\b\w{2,4}\b'

    # Compile the regex pattern
    regex = re.compile(pattern)

    # Use re.sub() to remove the matched words
    cleaned_string = regex.sub('', input_string)

    return cleaned_string

# Sample Text
sample_text = "The following example creates an ArrayList with a capacity of 50 elements. 4 elements are then added to the ArrayList and the ArrayList is trimmed accordingly."

# Call the function to remove words of length 2 to 4
cleaned_text = remove_words_of_length_2_to_4(sample_text)

# Print the cleaned text
print("Original Text:")
print(sample_text)
print("\nCleaned Text:")
print(cleaned_text)
Enter fullscreen mode Exit fullscreen mode

Write a python program using RegEx to remove like symbols
Check the below sample text
, there are strange symbols something of the sort all over the place. You need to come up with a general Regex expression that will cover all such symbols.
Sample Text: "@Jags123456 Bharat band on 28??Those who are protesting #demonetization are all different party leaders" using regx function

import re

# Sample text
sample_text = "@Jags123456 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who are protesting #demonetization are all different party leaders"

# Define a regex pattern to match <U+..> symbols
pattern = r'<U\+[0-9A-Fa-f]+>'

# Use re.sub() to remove the matched symbols
cleaned_text = re.sub(pattern, '', sample_text)

# Print the cleaned text
print("Original Text:")
print(sample_text)
print("\nCleaned Text:")
print(cleaned_text)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)