Text Preprocessing for NLP: Basic Concepts

February 24, 2021 | By Michaela Buchanan | Category: Quick Tips

Natural language processing requires the use of large amounts of text in order to train models and make predictions. However, human language is complex and steps must be taken to make it easier to generalize and understand for our algorithms. For this reason, data preprocessing in NLP tasks can be a labor-intensive operation. Below I’ve listed three common strategies used to make it easier for the algorithm to understand the text you provide as well as how to implement them using the NLTK (Natural Language ToolKit) python library (version 3.5).

Setup

In order to get started with NLTK on Linux, simply use pip or conda (either pip install nltk or conda install -c anaconda nltk) to install the package and then import it with import nltk. For instructions on installation for Windows or MacOS, please visit this link: https://www.nltk.org/install.html.

Tokenizing

When starting with sentences or even paragraphs of text, the first step is to break the text down into smaller units, either into sentences or individual words. NLTK provides both sentence and word tokenizers for this purpose. Below is an example for how to use the NLTK word tokenizer.

import nltk

from nltk.tokenize import word_tokenize

nltk.download(‘punkt’)

Data = “Hello world!”

words = nltk.word_tokenize(data)

print(words)

Output:

[‘Hello’, ‘World’, ‘!’]

This example can be easily modified to tokenize by sentence instead of by word by simply replacing all instances of word_tokenize with sent_tokenize.

Stemming

While grammatical cues are useful for humans when reading text, they are often unnecessary and confusing for a computer trying to make sense of a piece of text. For this reason, words are often simplified during NLP preprocessing. Stemming is one technique for doing this and involves removing the prefixes or suffixes of a word to break it down to a more generic form. For example, if you stem the list of words [“waited”, “waits”, “waiting”], they should all turn into “wait” after stemming. Using stemming in NLTK is pretty straightforward, as seen in the sample function below.

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

nltk.download(‘punkt’)

ps = PorterStemmer()

data = “I went hiking with friends today.”

words = word_tokenizer(data)

For word in words:

print(ps.stem(word))

Output:

[‘I’, ‘went’, ‘hike’, ‘with’, ‘friend’, ‘today’, ‘.’]

Lemmatization

While stemming is a great start to preprocessing text for NLP, lemmatization is usually the preferred. Stemming only removes the prefix or suffix from a word. Lemmatization, on the other hand, takes into account a word’s part of speech in order to turn it into the lemma, or base form, of the word. Because lemmatization is more sophisticated than stemming it can produce results that are less ambiguous than stemming. See below for an example on how to use lemmatization in NLTK.

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

from nltk.corpus import wordnet

nltk.download(‘punkt’)

nltk.download(‘wordnet’)

nltk.download(‘averaged_perceptron_tagger’)

# from https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

def get_wordnet_pos(treebank_tag):

if treebank_tag.startswith('J'):

return wordnet.ADJ

elif treebank_tag.startswith('V') or treebank_tag.startswith('M'):

return wordnet.VERB

elif treebank_tag.startswith('N') or treebank_tag.startswith('P'):

return wordnet.NOUN

elif treebank_tag.startswith('R'):

return wordnet.ADV

else:

return wordnet.NOUN

# make lemmatizer and encoder

lemmatizer = WordNetLemmatizer()

data = “I went hiking with friends today.”

tokens = word_tokenize(data)

pos_list = nltk.pos_tag(tokens)

for word, ps in zip(tokens, pos_list):

print(lemmatizer.lemmatize(word, pos=get_wordnet_pos(ps[1])))

Output:

[‘I’, ‘go’, ‘hike’, ‘with’, ‘friend’, ‘today’, ‘.’]

Conclusion

While this was only a brief overview of some fundamental topics in NLP preprocessing, it should give you the basic tools you need to start preparing text for ML/DL use. If you would like more information about these tools and many others, check out NLTK’s official documentation at https://www.nltk.org.

Mark III Systems Blog

Text Preprocessing for NLP: Basic Concepts

Setup

Tokenizing

Stemming

Lemmatization

Conclusion

Categories

Archives

Get In Touch With Our Team

Mark III Systems Blog

Text Preprocessing for NLP: Basic Concepts

Setup

Tokenizing

Stemming

Lemmatization

Conclusion

RelatedArticles

Mark III Recognized as 2024 NVIDIA Partner Network Americas Healthcare Partner of the Year

Benchmarking LLM, Multi-GPU Finetuning Training Strategies with PyTorch Lightning on NVIDIA DGX

Categories

Archives

Get In Touch With Our Team

Related
Articles