Mark III Systems Blog

Text Preprocessing for NLP: Basic Concepts

Natural language processing requires the use of large amounts of text in order to train models and make predictions. However, human language is complex and steps must be taken to make it easier to generalize and understand for our algorithms. For this reason, data preprocessing in NLP tasks can be a labor-intensive operation. Below I’ve listed three common strategies used to make it easier for the algorithm to understand the text you provide as well as how to implement them using the NLTK (Natural Language ToolKit) python library (version 3.5). 


In order to get started with NLTK on Linux, simply use pip or conda (either pip install nltk or conda install -c anaconda nltk) to install the package and then import it with import nltk.  For instructions on installation for Windows or MacOS, please visit this link:


When starting with sentences or even paragraphs of text, the first step is to break the text down into smaller units, either into sentences or individual words. NLTK provides both sentence and word tokenizers for this purpose. Below is an example for how to use the NLTK word tokenizer.

import nltk

from nltk.tokenize import word_tokenize‘punkt’)

Data = “Hello world!”

words = nltk.word_tokenize(data)



[‘Hello’, ‘World’, ‘!’]

This example can be easily modified to tokenize by sentence instead of by word by simply replacing all instances of word_tokenize with sent_tokenize.


While grammatical cues are useful for humans when reading text, they are often unnecessary and confusing for a computer trying to make sense of a piece of text. For this reason, words are often simplified during NLP preprocessing. Stemming is one technique for doing this and involves removing the prefixes or suffixes of a word to break it down to a more generic form. For example, if you stem the list of words [“waited”, “waits”, “waiting”], they should all turn into “wait” after stemming. Using stemming in NLTK is pretty straightforward, as seen in the sample function below. 

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer‘punkt’)

ps = PorterStemmer()

data = “I went hiking with friends today.”

words = word_tokenizer(data)

For word in words:



[‘I’, ‘went’, ‘hike’, ‘with’, ‘friend’, ‘today’, ‘.’]


While stemming is a great start to preprocessing text for NLP, lemmatization is usually the preferred. Stemming only removes the prefix or suffix from a word. Lemmatization, on the other hand, takes into account a word’s part of speech in order to turn it into the lemma, or base form, of the word. Because lemmatization is more sophisticated than stemming it can produce results that are less ambiguous than stemming. See below for an example on how to use lemmatization in NLTK.

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer

from nltk.corpus import wordnet‘punkt’)‘wordnet’)‘averaged_perceptron_tagger’)

# from

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):

        return wordnet.ADJ

    elif treebank_tag.startswith('V') or treebank_tag.startswith('M'):

        return wordnet.VERB

    elif treebank_tag.startswith('N') or treebank_tag.startswith('P'):

        return wordnet.NOUN

    elif treebank_tag.startswith('R'):

        return wordnet.ADV


        return wordnet.NOUN

# make lemmatizer and encoder

lemmatizer = WordNetLemmatizer()

data = “I went hiking with friends today.”

tokens = word_tokenize(data)

pos_list = nltk.pos_tag(tokens)

for word, ps in zip(tokens, pos_list):

print(lemmatizer.lemmatize(word, pos=get_wordnet_pos(ps[1])))


[‘I’, ‘go’, ‘hike’, ‘with’, ‘friend’, ‘today’, ‘.’]


While this was only a brief overview of some fundamental topics in NLP preprocessing, it should give you the basic tools you need to start preparing text for ML/DL use. If you would like more information about these tools and many others, check out NLTK’s official documentation at