Data Cleaning in NLP with Python Examples

Handful explanation of natural language processing techniques

NLP stands for Natural Language Processing. It is a branch of artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

Natural Language Processing involves the interaction between computers and humans through natural language, which includes tasks such as speech recognition, text analysis, sentiment analysis, language translation, question answering, and text generation.

NLP technologies are designed to process and analyze text data in various forms, such as written text, spoken words, and other forms of human-generated language. NLP algorithms and models are trained on large datasets to learn patterns in language and can be used in a wide range of applications, including virtual assistants, chatbots, language translation
tools, sentiment analysis tools, and many other applications where human-computer interaction involves language processing.

Natural Language Processing has become an increasingly important field of study and has found numerous applications in various industries, including healthcare, finance, customer service, marketing, and many others.

Topics to be covered:

Text cleaning
Tokenization
Stemming
Lemmatization
N-grams

Pre-processing is an essential step in Natural Language Processing (NLP) that involves cleaning, transforming, and preparing raw text data for further analysis or modeling.

It is a crucial step because text data obtained from various sources can be noisy, unstructured, and contain irrelevant information, which can adversely affect the accuracy and effectiveness of NLP algorithms. Pre-processing aims to transform the raw text data into a structured format that is easier to analyze and extract meaningful insights from.

Some common pre-processing techniques used in NLP include:

Text Cleaning: This involves removing irrelevant information such as special characters, punctuation, numbers, and white spaces from the text data. It may also involve converting text to lowercase or uppercase to ensure consistency in the data.
Tokenization: This involves breaking the text data into individual words or tokens, which are the basic units of text for analysis. Tokenization helps in preparing the text data for further analysis, such as word frequency analysis, part-of-speech tagging, and sentiment analysis.
Stopword Removal: Stopwords are common words such as “the,” “and,” “is,” “in,” etc.. These words show no meaning in the text and can be filtered from the raw data.
Lemmatization and Stemming: These techniques involve reducing words to their root or base form to consolidate words with similar meanings. Lemmatization converts words to their base form using language rules, while stemming involves removing prefixes or suffixes from words to get their root form.
Removing HTML tags, URLs, or other special elements: If the text data is obtained from web pages or other online sources, it may contain HTML tags, URLs, or other special elements that need to be removed before analysis.
Spell Checking and Correction: This involves identifying and correcting misspelled words in the text data to improve the accuracy of the analysis.
Removing or replacing specific patterns or patterns: This may include removing or replacing specific patterns, such as email addresses, phone numbers, or any other patterns that are not relevant to the analysis.
Handling Text Encoding: Text data may have different encodings, such as UTF-8, ASCII, or Unicode, which may need to be converted to a common encoding format for consistency and compatibility in further analysis.
Handling Imbalanced Data: In some NLP tasks, such as sentiment analysis or text classification, the text data may be imbalanced, with one class dominating the data. Pre-processing techniques in this situation can be addressed by two methods i.e., oversampling or undersampling,

These are some common pre-processing techniques used in NLP to clean, transform, and prepare raw text data for further analysis or modeling. The specific pre-processing steps may vary depending on the specific task or application, and it is important to carefully consider the requirements and characteristics of the text data to ensure accurate and meaningful results in NLP analysis.

Text cleaning

Text cleaning is an important step in Natural Language Processing (NLP) that involves cleaning and preparing raw text data for further analysis or modeling. Here’s an example of how you can perform text cleaning in Python using various libraries commonly used in NLP, such as NLTK (Natural Language Toolkit) and regular expressions (regex):

Python Example:

import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load NLTK stop words
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

# Example raw text data
text = "specific text cleaning steps may vary depending on the requirements !@#$%^&*()"

# Convert to lowercase
text = text.lower()

# using re to remove special characters 
text = re.sub(r"[^a-zA-Z]", " ", text)

# Tokenize the text into individual words
token_words = word_tokenize(text)

# Remove stopwords
tokens = [word for word in token_words if word not in stop_words]

# Join the tokens back into a clean text string
clean_text = " ".join(tokens)

print("Original Text: ", text)
print("Clean Text: ", clean_text)

#output:
Original Text:  specific text cleaning steps may vary depending on the requirements           
Clean Text:  specific text cleaning steps may vary depending requirements

In the above example, we first load the NLTK stop words, which are common words like “the,” “and,” “is,” etc., that are often removed from text data during text cleaning. Then, we provide an example of raw text data that needs to be cleaned.

We convert the text to lowercase, remove numbers and special characters using regular expressions, tokenize the text into individual words using NLTK’s word_tokenize() function, and then remove the stop words from the tokens. Finally, we join the tokens back into a clean text string using Python’s string manipulation capabilities.

Note that the specific text cleaning steps may vary depending on the requirements and characteristics of the text data, and it’s important to carefully consider the specific needs of your NLP task to ensure accurate and meaningful results.

Additionally, text cleaning is often an iterative process that may require experimentation and fine-tuning based on the specific text data and NLP task at hand.

Tokenization

Tokenization is the process of breaking text data into individual words or tokens, which are the basic units of text for analysis. Here’s an example of how you can perform tokenization in Python using the Natural Language Toolkit (NLTK) library:

Python Example:

import nltk
from nltk.tokenize import word_tokenize

# Example raw text data
text = "My name is Amit Chauhan. I live in New Delhi"

# Tokenize the text into individual words
tokens = word_tokenize(text)

print("Original Text: ", text)
print("Tokens: ", tokens)

#output:
Original Text:  Tokenization is the process of breaking text data into individual words
Tokens:  ['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'text', 'data', 'into', 'individual', 'words']

In the above example, we first import the nltk library and load the word_tokenize() function from the nltk.tokenize module. Then, we provide an example of raw text data that needs to be tokenized. We pass the text data to the word_tokenize() function, which tokenizes the text into individual words and stores them in a list called tokens.

Stemming

Stemming is a common technique used in Natural Language Processing (NLP) to reduce words to their root or base form, known as a “stem.” This helps in reducing the complexity of text data and grouping similar words. Here’s an example of how you can perform stemming in Python using the NLTK (Natural Language Toolkit) library:

Python Example:

import nltk
from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
ps = PorterStemmer()

# Example word to be stemmed
word = "running"

# Perform stemming
stemmed_word = ps.stem(word)

print("Original Word: ", word)
print("Stemmed Word: ", stemmed_word)

#output:
Original Word:  running
Stemmed Word:  run

In the above example, we first import the nltk library and create an instance of the PorterStemmer class from the nltk.stem module. Then, we provide an example word “running” that needs to be stemmed. We call the stem() method of the PorterStemmer class, passing the word as an argument, which performs stemming and returns the stemmed form of the word.

Lemmatization

Lemmatization is a technique used in Natural Language Processing (NLP) to reduce words to their base or canonical form, known as a “lemma.” Unlike stemming, which simply removes suffixes from words, lemmatization takes into consideration the morphological analysis of words and reduces them to their meaningful base form. Here’s an example of how you can perform lemmatization in Python using the NLTK (Natural Language Toolkit) library:

Python Example:

import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Create an instance of the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Example word to be lemmatized
word = "running"

# check part of speech (POS) Tag
pos_tag = nltk.pos_tag([word])[0][1]

# mapping POS to wor_net POS
wn_pos = wordnet.ADJ if pos_tag.startswith('J') else wordnet.VERB if pos_tag.startswith('V') else wordnet.NOUN if pos_tag.startswith('N') else wordnet.ADV

# Perform lemmatization
lemmatized_word = lemmatizer.lemmatize(word, wn_pos)

print("Original Word: ", word)
print("Lemmatized Word: ", lemmatized_word)

#output:
Original Word:  running
Lemmatized Word:  run

In the above example, we first import the nltk library and create an instance of the WordNetLemmatizer class from the nltk.stem module. Then, we provide an example word “running” that needs to be lemmatized. We use the nltk.pos_tag() function to determine the Part-Of-Speech (POS) tag of the word, as lemmatization is often dependent on the POS of the word. We then map the POS tag to the corresponding WordNet POS tag, as the WordNetLemmatizer requires the use of specific POS tags. Finally, we call the lemmatize() method of the WordNetLemmatizer class, passing the word and the mapped POS tag as arguments, which performs lemmatization and returns the lemmatized form of the word.

N-grams

In natural language processing (NLP), n-grams are contiguous sequences of n words or characters extracted from a given text. N-grams are used to represent language patterns and can be useful in various NLP tasks such as text generation, language modeling, sentiment analysis, and machine translation.

Here’s an example of how you can generate n-grams in Python:

def generate_ngrams(text, n):
    """
    Function to generate n-grams from a given text.

    Args:
    - text (str): Input text
    - n (int): Number of words or characters in each n-gram

    Returns:
    - List of n-grams
    """
    words = text.split()  
    ngrams = []
    for i in range(len(words)-(n-1)):
        ngram = ' '.join(words[i:i+n])  # Join n words to form an n-gram
        ngrams.append(ngram)
    return ngrams

# Example usage
text = "Today is a beautiful day and sky is cloudy"
n = 3  # Generate trigrams (3-grams)
ngrams = generate_ngrams(text, n)
print(f"{n}-grams: {ngrams}")

#output:
3-grams: ['Today is a', 'is a beautiful', 'a beautiful day', 'beautiful day and', 'day and sky', 'and sky is', 'sky is cloudy']

In the above example, we define a function called generate_ngrams() that takes a text and an integer n as input, where n specifies the size of the n-grams to be generated. The function splits the input text into words using split(), and then iterates through the words to generate n-grams by joining n consecutive words using join(). The resulting n-grams are stored in a list and returned.

I hope you like the article. Reach me on my LinkedIn and twitter.

Original Post>

Large-scale custom natural language processing

Enjoyed this article? Sign up for our newsletter to receive regular insights and stay connected.

Handful explanation of natural language processing techniques

Topics to be covered:

Like this:

Related

Handful explanation of natural language processing techniques

Topics to be covered:

Share this:

Like this:

Related

Discover more from Global Intelligence and Insight Platform: IT Innovation, ETF Investment, plus Health Wellbeing