Mark III Systems Blog

Phrase Comparison in Python

In this blog post I will be discussing some easy to implement techniques for comparing strings in Python. I will be mainly focusing on what is known as “fuzzy” string matching, which means we want to detect not only exact string matches, but also similar strings that contain typos or reordering of the same words. These techniques can be useful in a variety of problems, in my case natural language processing.

Difflib

One can use the SequenceMatcher function in the difflib python library to do simple string comparisons. This library will take two phrases and output the ratio of characters that match between the two strings. By setting a threshold of ratios you consider to be a match (i.e. everything greater than 0.80 is a match), one can implement fuzzy string matching using SequenceMatcher. Below is an example for how one might implement this solution in code. 

from difflib import SequenceMatcher

string1 = ‘I am a test string’

string2 = ‘I am a testing string’

seq = SequenceMatcher(None, string1, string2)

seq.ratio()

Output:

0.9230769230769231

TextDistance

If you need more complex pattern matching then what difflib’s SequenceMatcher can provide, you may want to check out the textdistance Python library. Below are some examples of algorithms included in the textdistance library. To see more, go to their documentation linked here (https://pypi.org/project/textdistance/). Before trying out any of the code below, please make sure to pip install textdistance.

 Levenshtein Distance:

This algorithm counts the minimum number of modifications you must make to one string to turn it into the string you are comparing it to. This is useful for situations where you expect to see typos or slight variations on words, but not if you want to match strings where words have been rearranged. Below is a code example of this algorithm.

import textdistance

textdistance.levenshtein(“I am a test string”, “I am a testing string”)

Output:

3

Jaccard Similarity:

The Jaccard similarity algorithm outputs a ratio of how many shared characters are found in two strings regardless of order. This can be especially useful in situations where you think words may be rearranged in your strings. Below is a code example using this algorithm.

import textdistance

textdistance.jaccard(“I am a test string”, “I am a string of test”)

Output:

0.8571428571428571

MRA (Match Rating Approach):

The MRA algorithm works a bit differently from the other algorithms discussed below. Instead of looking at just the characters in the strings to find matches, it tries to tell if two strings are similar based on the phonetic sounds of the words in the strings. This can be very useful for handling text from non-native language speakers or common typos. Below is a code example using this algorithm.

import textdistance

textdistance.mra(“doe”, “dough”)

Output:

1