Fuzzy string matching using cosine similarity
Lately i've been dealing quite a bit with mining unstructured data[1]. This often involved determining the similarity of Strings and blocks of text. One of the more interesting algorithms i came across was the Cosine Similarity algorithm. It's a pretty popular way of quantifying the similarity of sequences by treating them as vectors and calculating their cosine. This delivers a value between 0 and 1; where 0 means no similarity whatsoever and 1 meaning that both sequences are exactly the same.
A bit on vectors
If you're anything like me, it's been a while since you've had to deal with any of this vector stuff so here is a quick refresher to get you up to speed.
Vectors are geometric elements that have magnitude (length) and direction. We can perform mathematical operations on them just as we would on regular (scalar) numbers as well as a few special operations of which we'll need two; the dot product and magnitude.
To keep things simple, let's consider two vectors
\[\overrightarrow{a} = (1, 2, 3)\]
and,
\[\overrightarrow{b} = (4, 5, 6)\]
Their dot product can be calculated as
\[\overrightarrow{a}\cdot \overrightarrow{b} = a_{1}b_{1} + a_{2}b_{2}+ \cdot\cdot\cdot + a_{n}b_{n} = \sum_{i=1}^n a_{i} b_{i}\]
in practice this would be,
\[\overrightarrow{a}\cdot\overrightarrow{b} = (1, 2, 3) \cdot (4, 5, 6) = 1 \times 4 + 2 \times 5 + 3 \times 6 = 32 \]
We can also calculate their respective magnitudes. Which are represented as,
\[\parallel \overrightarrow{a} \parallel = \sqrt{a_1^2 + a_2^2 + \cdot\cdot\cdot + a_n^2 }\]
in practice,
\[\parallel \overrightarrow{a} \parallel = \sqrt{ 1^{2}+2^{2}+3^{2} } = \sqrt{14} \approx 3.7166\]
and
\[\parallel \overrightarrow{b} \parallel = \sqrt{ 4^{2}+5^{2}+6^{2} } = \sqrt{77} \approx 8.77496\]
Calculating cosine similarity
Cosine similarity can be expressed mathematically as,
\[similarity(\overrightarrow{a} , \overrightarrow{b}) = \cos\theta = \frac{\overrightarrow{a} \cdot \overrightarrow{b} }{\parallel \overrightarrow{a} \parallel \parallel \overrightarrow{b} \parallel}\]
Expanding it a little further (I know... don't panic),
\[\frac{\overrightarrow{a} \cdot \overrightarrow{b}}{\parallel \overrightarrow{a} \parallel \parallel \overrightarrow{b} \parallel} = \frac{\sum_{i=1}^n a_{i} b_{i}}{\sqrt{\sum_{i=1}^n(a_{i})^2} \times {\sqrt{\sum_{i=1}^n(b_{i})^2}}}\]
This basically means,
- Get the dot product of vectors a and b
- Multiply magnitude a and magnitude b
- Divide the dot product of vectors a and b by the product of magnitude a and magnitude b.
Now that the math is out of the way, we can begin applying this algorithm to strings.
Working with strings
The first step is to turn our strings into vectors that we can work with. We do this by considering the Term Frequency of unique words or letters in a given string. Taking the following two strings as an example,
a. He is the hero Gotham deserves
b. but not the one it needs right now.
We can identify 13 unique words shared between these two strings. We build our vector by noting how many times each word occurred in each string.
Word a b
1. the 1 1
2. he 1 0
3. is 1 0
4. it 0 1
5. right 0 1
6. needs 0 1
7. but 0 1
8. hero 1 0
9. deserves 1 0
10. not 0 1
11. now 0 1
12. one 0 1
13. gotham 1 0
Now we have two vectors
\[\overrightarrow{a} = (1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1)\]
\[\overrightarrow{b} = (1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0)\]
With this we can easily calculate the dot product and magnitudes of the vectors using the method that was illustrated earlier.
a . b = 1, ||a|| = 2.44948974278, ||b|| = 2.82842712475
Cosine similarity = 1 / (2.44948974278) * (2.82842712475)
Cosine similarity = 0.144337567297
Well if it wasn't already obvious that those two strings were not very similar before (only sharing a common word; 'the'), now we have data to prove it.
For word cosine similarity, we consider unique characters in the string to construct our vector,
a. bob
b. rob
I'm sure you already see where this is going, but i'll show you anyway
Letter a b
1. b 2 1
2. o 1 1
3. r 0 1
Now we just compute the cosine similarity using the vectors
a . b = 3, ||a|| = 2.2360679775, ||b|| = 1.73205080757
Cosine similarity = 3 / (2.2360679775) * (1.73205080757)
Cosine similarity = 0.774596669241
It turns out, 'bob' and 'rob' are indeed similar; approximately 77% according to the results.
It's important to note that this algorithm does not consider the order of the words, It only uses their frequency as a measure.
Diving into the code
I'm sure that you've guessed that it's actually pretty easy to turn this algorithm into working code. Here's my take on the solution[2].
/**
* @param terms values to analyze
* @return a map containing unique
* terms and their frequency
*/
public static Map<String, Integer> getTermFrequencyMap(String[] terms) {
Map<String, Integer> termFrequencyMap = new HashMap<>();
for (String term : terms) {
Integer n = termFrequencyMap.get(term);
n = (n == null) ? 1 : ++n;
termFrequencyMap.put(term, n);
}
return termFrequencyMap;
}
/**
* @param text1
* @param text2
* @return cosine similarity of text1 and text2
*/
public static double cosineSimilarity(String text1, String text2) {
//Get vectors
Map<String, Integer> a = getTermFrequencyMap(text1.split("\\W+"));
Map<String, Integer> b = getTermFrequencyMap(text2.split("\\W+"));
//Get unique words from both sequences
HashSet<String> intersection = new HashSet<>(a.keySet());
intersection.retainAll(b.keySet());
double dotProduct = 0, magnitudeA = 0, magnitudeB = 0;
//Calculate dot product
for (String item : intersection) {
dotProduct += a.get(item) * b.get(item);
}
//Calculate magnitude a
for (String k : a.keySet()) {
magnitudeA += Math.pow(a.get(k), 2);
}
//Calculate magnitude b
for (String k : b.keySet()) {
magnitudeB += Math.pow(b.get(k), 2);
}
//return cosine similarity
return dotProduct / Math.sqrt(magnitudeA * magnitudeB);
}
This code splits the given strings into words using the regex pattern \W+
. To adapt this for characters, (?!^)
should be used instead.
Using the algorithm for fuzzy string matching
As an experiment, i ran the algorithm against the OSX internal dictionary[3] which contained about 235886 words. I also filtering out words with a similarity of less than 9. Running this against the keyword "hello" returned the following,
chello index: 0.9354143466934853
hell index: 0.9258200997725514
hellhole index: 0.9799578870122228
hello index: 1.0
helluo index: 0.9354143466934853
hole index: 0.944911182523068
holl index: 0.9258200997725514
holler index: 0.9354143466934853
molehill index: 0.9091372900969896
oilhole index: 0.9116846116771036
theelol index: 0.9116846116771036
wellhole index: 0.944911182523068
12 of 235886 potential matches for 'hello'
As an exact match, "hello" get's a value of 1, with second place going to "hellhole" and a tie for third between "wellhole" and "hole".
It's clear that this does a pretty good job of picking out potential matches. However, there are clear limitations in the approach. since the order of the letters in the string are not considered, words like wellhole were ranked about as high as hole. This is due to the Term Frequency of the characters in the word. This means that the results will need to be further refined if finer granularity is required.
Got something to add? Join the conversation on reddit