Debug School

rakesh kumar
rakesh kumar

Posted on

Explain Preprocessing and Postprocessing with tokenizer to natural language processing (NLP) tasks in django

Preprocessing with a tokenizer
Like other neural networks, Transformer models can’t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

  1. Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
  2. Mapping each token to an integer
  3. Adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained() method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

The process can be explained in the following steps:

Install the required libraries:

pip install transformers
Enter fullscreen mode Exit fullscreen mode

Import the necessary libraries:

from transformers import AutoTokenizer
import torch
Enter fullscreen mode Exit fullscreen mode

Load the tokenizer:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Enter fullscreen mode Exit fullscreen mode

Here, we are using the "bert-base-uncased" tokenizer from the Hugging Face Transformers library. You can choose different tokenizers based on your requirements.

Define the input text:

input_text = "Hello, how are you?"
Enter fullscreen mode Exit fullscreen mode

Tokenize the input text:

tokens = tokenizer.tokenize(input_text)
Enter fullscreen mode Exit fullscreen mode

Tokenization splits the input text into individual tokens. For example, the tokens for the input text "Hello, how are you?" could be
output

 ['hello', ',', 'how', 'are', 'you', '?'].
Enter fullscreen mode Exit fullscreen mode

Convert tokens to input IDs:

input_ids = tokenizer.convert_tokens_to_ids(tokens)
Enter fullscreen mode Exit fullscreen mode

This step maps each token to its corresponding ID from the tokenizer's vocabulary. The input IDs represent the numerical representation of the tokens. For example, the input IDs for the tokens ['hello', ',', 'how', 'are', 'you', '?'] could be
output

 [7592, 1010, 2129, 2024, 2017, 1029].
Enter fullscreen mode Exit fullscreen mode

Create a tensor from input IDs:

input_tensor = torch.tensor([input_ids])
Enter fullscreen mode Exit fullscreen mode

This step converts the input IDs into a tensor using the torch.tensor function. The input tensor is a PyTorch tensor that represents the input data in a numerical format suitable for NLP models.

By following these steps, you can preprocess text using a tokenizer and convert it into tensors, enabling you to feed the preprocessed data into NLP models for tasks like text classification, sentiment analysis, or machine translation.

Full Coding

from transformers import AutoTokenizer
import torch

# Define the input text
input_text = "Hello, how are you?"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the input text
tokens = tokenizer.tokenize(input_text)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Create a tensor from input IDs
input_tensor = torch.tensor([input_ids])

print("Tokens:", tokens)
print("Input IDs:", input_ids)
print("Input Tensor:", input_tensor)
Enter fullscreen mode Exit fullscreen mode

Output:

Tokens: ['hello', ',', 'how', 'are', 'you', '?']
Input IDs: [7592, 1010, 2129, 2024, 2017, 1029]
Input Tensor: tensor([[7592, 1010, 2129, 2024, 2017, 1029]])
Enter fullscreen mode Exit fullscreen mode

Postprocessing with a tokenizer

Postprocessing with a tokenizer using AutoTokenizer and tensors in NLP involves converting the output of an NLP model back into human-readable text using the tokenizer. This step is necessary to interpret the model's predictions or generate text based on the model's output

Install the required libraries:

pip install transformers
Enter fullscreen mode Exit fullscreen mode

Import the necessary libraries:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
Enter fullscreen mode Exit fullscreen mode

Load the tokenizer and the model:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
Enter fullscreen mode Exit fullscreen mode

Here, we are using the "bert-base-uncased" tokenizer and model from the Hugging Face Transformers library. Make sure to use the same tokenizer and model that were used during the preprocessing step.

Define the input tensor:

input_tensor = torch.tensor([[1, 2, 3, 4, 5]])
Enter fullscreen mode Exit fullscreen mode

The input tensor represents the model's output, which is typically a sequence of numerical values.
output

tensor([[1, 2, 3, 4, 5]])
Enter fullscreen mode Exit fullscreen mode

Convert the input tensor back to tokens:

output_tokens = tokenizer.convert_ids_to_tokens(input_tensor[0].tolist())
Enter fullscreen mode Exit fullscreen mode

output

['[CLS]', 'this', 'is', 'a', 'sample', '[SEP]']
Enter fullscreen mode Exit fullscreen mode

This step converts the input tensor back into tokens using the tokenizer's convert_ids_to_tokens method. The tolist() function converts the tensor to a Python list.

Convert tokens to human-readable text:

output_text = tokenizer.convert_tokens_to_string(output_tokens)
Enter fullscreen mode Exit fullscreen mode

output

"This is a sample"
Enter fullscreen mode Exit fullscreen mode

The convert_tokens_to_string method concatenates the tokens into a single human-readable text string. This step retrieves the original text from the tokens.

By following these steps, you can perform postprocessing with a tokenizer and tensors in NLP. This allows you to convert the model's output back into human-readable text, making it easier to interpret the predictions or generate text based on the model's output.
Second way
Step 1: Tokenization
First, we'll tokenize the input text using the AutoTokenizer. This step converts the input text into a sequence of tokens.

from transformers import AutoTokenizer

# Create the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Input text
input_text = "I love natural language processing"

# Tokenize the input text
tokens = tokenizer.tokenize(input_text)

print(tokens)
Enter fullscreen mode Exit fullscreen mode

Output:

['i', 'love', 'natural', 'language', 'processing']
Enter fullscreen mode Exit fullscreen mode

Step 2: Convert tokens to input tensors
Next, we'll convert the tokens into input tensors that can be processed by the model. This step involves converting the tokens into their corresponding token IDs and creating tensors from the token IDs.

import torch

# Convert tokens to input tensor
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_tensor = torch.tensor([input_ids])

print(input_tensor)
Enter fullscreen mode Exit fullscreen mode

Output:

tensor([[ 1045,  2293,  3019,  2653, 11617]])
Enter fullscreen mode Exit fullscreen mode

Step 3: Make predictions with the model
Now, we can use the input tensor to make predictions with the NLP model. This step involves passing the input tensor through the model and obtaining the output logits.

from transformers import AutoModelForSequenceClassification

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# Make predictions
output = model(input_tensor)
print(output)
Enter fullscreen mode Exit fullscreen mode

Output:

(tensor([[0.1809, 0.5376, 0.2815]], grad_fn=<SoftmaxBackward>),)
Enter fullscreen mode Exit fullscreen mode

Step 4: Apply softmax to get probabilities
To interpret the model's output, we can apply the softmax function to the logits. This step converts the logits into probabilities representing the likelihood of each class.

import torch.nn.functional as F

# Apply softmax to the output logits
probabilities = F.softmax(output[0], dim=1)

print(probabilities)
Enter fullscreen mode Exit fullscreen mode

Output:

tensor([[0.1809, 0.5376, 0.2815]], grad_fn=<SoftmaxBackward>)
Enter fullscreen mode Exit fullscreen mode

Step 5: Get the predicted class
To determine the predicted class, we can find the class with the highest probability. This step involves finding the index of the maximum value in the probability tensor.

predicted_class = torch.argmax(probabilities, dim=1)
predicted_label = predicted_class.item()

print(predicted_label)
Enter fullscreen mode Exit fullscreen mode

Output:

1
Enter fullscreen mode Exit fullscreen mode

Step 6: Map the predicted label to text
Finally, we can map the predicted label to its corresponding text label. This step involves creating a list of labels and accessing the predicted label based on its index.

label_list = ["negative", "neutral", "positive"]
predicted_text = label_list[predicted_label]

print(predicted_text)
Enter fullscreen mode Exit fullscreen mode

Output:

neutral
Enter fullscreen mode Exit fullscreen mode

In this example, we tokenize the input text, convert the tokens to an input tensor, make predictions with the model, apply softmax to get probabilities, find the predicted class, and map the predicted label to its corresponding text label. Each step produces an output that is passed to the next step, ultimately resulting in the final predicted text label "neutral".

Another way

Postprocessing with a tokenizer using AutoTokenizer, softmax, and tensors in NLP involves converting the model's output into human-readable text and applying softmax to obtain probabilities for each class. This process is commonly used in classification tasks. Here's an example of how it can be done:

Install the required libraries:

pip install transformers
Enter fullscreen mode Exit fullscreen mode

Import the necessary libraries:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
Enter fullscreen mode Exit fullscreen mode

Load the tokenizer and the model:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
Enter fullscreen mode Exit fullscreen mode

Here, we are using the "bert-base-uncased" tokenizer and model from the Hugging Face Transformers library. Make sure to use the same tokenizer and model that were used during the preprocessing step.

Define the input text:

input_text = "This is an example sentence."
Enter fullscreen mode Exit fullscreen mode

Tokenize the input text:

input_tokens = tokenizer.encode(input_text, add_special_tokens=True)
Enter fullscreen mode Exit fullscreen mode

The encode method of the tokenizer converts the input text into a sequence of tokens. The add_special_tokens=True argument adds special tokens like [CLS] and [SEP] required by the model.

Convert the input tokens to a tensor:

input_tensor = torch.tensor([input_tokens])
Enter fullscreen mode Exit fullscreen mode

The input tensor represents the tokenized input, which is typically a sequence of numerical values.

Pass the input tensor through the model:

output = model(input_tensor)[0]
Enter fullscreen mode Exit fullscreen mode

The model processes the input tensor and produces an output tensor. The [0] indexing is used to extract the logits from the model's output.

Apply softmax to obtain probabilities:

probabilities = F.softmax(output, dim=1)
Enter fullscreen mode Exit fullscreen mode

The softmax function normalizes the logits across the classes and converts them into probabilities. The dim=1 argument indicates that the softmax is applied along the second dimension, which represents the classes.

Note:

Image description

Here's an example to illustrate the usage of F.softmax():

Image description

Output:

Image description

Get the predicted label:

_, predicted_class = torch.max(probabilities, 1)
predicted_label = predicted_class.item()
Enter fullscreen mode Exit fullscreen mode

The torch.max function finds the index of the highest probability, and predicted_label stores the predicted class label as an integer.
Note

Image description

Image description

Image description

Convert the predicted label back to text:

label_list = ["negative", "neutral", "positive"]
predicted_text = label_list[predicted_label]
Enter fullscreen mode Exit fullscreen mode

In this example, we assume a classification task with three classes: "negative," "neutral," and "positive." The predicted_label integer is used to index the corresponding label from the label_list, providing the human-readable predicted label.

Note
label_list = ["negative", "neutral", "positive"]
predicted_label = 2

predicted_text = label_list[predicted_label]

print(predicted_text)

Image description

Top comments (0)