Fuzzy string matching in NodeJS

String matching is an integral part of any programming language. Many times, however, one requires to get a fuzzy instead of an exact match between strings. Usually in search applications the same word may be spelled differently – which if we do an exact math will return empty results. Fuzzy matching allows you to search for ‘nearby’ strings to a candidate string.

In this post we will see how to use fuzzy string matching using the fuzzball package.

Installation of the fuzzball package is as usual done through npm.

npm install fuzzball

The fuzzy string matching works on the principle of edit distance – a way of quantifying how dissimilar two strings (e.g., words) are to one another. We also call it string similarity.

For example the following two strings are around 96% similar.

NEW YORK METS
NEW YORK MEATS

How did I arrive at that number? Well, there are various algorithms that lets you do that. The thing is that the fuzzball package does that for us. The similarity between two strings here is called the ‘fuzz ratio’.

Lets us see how we can test the fuzz ratio of the above string in NodeJS and the fuzzball package.

fuzz = require('fuzzball');
 
fuzz_ratio = fuzz.ratio("new york mets", "new york meats");
console.log(fuzz_ratio);
 
// Returns 96
// "!" is stripped and all characters are lower-cased 
// during pre-processing by default.
fuzz_ratio = fuzz.ratio("hello world", "Hello World!");
console.log(fuzz_ratio);
 
// Returns 100

This is the ‘Simple Ratio’. The other is ‘Partial Ratio’ – which returns the highest scoring substring of the longer string vs. the shorter string.

// substring of 2nd is a perfect match of the first
fuzz_ratio = fuzz.partial_ratio("test", "testing");
console.log(fuzz_ratio);
 
// Returns 100

Same string with different order

Out of order string sequences are common during string searches or queries returned by a database. For these purpose we can use the ‘Token Set Ratio’ – which tokenizes the string, sorts it and than recombines it before comparing. Compare the two methods below. The second one tokenizes, sorts and recombines each string, so the string positions do not matter. Token in this discussions refers to a single word. So if there are duplicates of a word than only a single word (token) is taken, the other discarded.

fuzz_ratio =fuzz.partial_ratio("eating right way", "right way eating");
// Returns 56
 
fuzz_ratio =fuzz.token_sort_ratio("eating right way", "right way eating");
// Returns 100

Finally there is ‘Token Set Ratio’, a flexible but a bit more complex idea of fuzzy matching.

fuzz_ratio =fuzz.token_set_ratio("the wonder years", "Danica McKellar the wonder years");
console.log(fuzz_ratio);
 
// Returns 100

Wildcards in string

You can set ‘options.wildcards’ to a string containing wildcard characters, to be used as wildcards when calculating edit distance. Each character in the string will be treated as a wildcard, and wildcards are case insensitive unless ‘options.full_process’ is set to false.

// Set '*' and '#' as wildcards
options = {wildcards: "*#"};
console.log(fuzz.ratio('fuz*ball', 'fu**ball', options));
 
// Returns 100

Batch Extract

Batch extract allows you to search list of choices for top results.

query = "polar bear";
choices = ["brown bear", "polar bear", "koala bear"];
 
results = fuzz.extract(query, choices);
console.log(results);
 
// [choice, score, index]
 [ [ 'polar bear', 100, 1 ],
   [ 'koala bear', 80,  2 ],
   [ 'brown bear', 60,  0 ] ]

The default search ratio is ‘fuzz.ratio’, but you can specify other ratios.

options = {
        // any function that takes two values and returns a score, default: ratio
        scorer: fuzz.partial_ratio, 
        // max number of top results to return, default: no limit / 0.
        limit: 2, 
        // lowest score to return, default: 0
        cutoff: 50, 
        // results won't be sorted if true, default: false. If true limit will be ignored.
        unsorted: false 
};
 
query = "polar bear";
choices = ["brown bear", "polar bear", "koala bear"];
 
results = fuzz.extract(query, choices, options);
 
console.log(results);
 
// [ [ 'polar bear', 100, 1 ], [ 'koala bear', 84, 2 ] ]

Changing default matching options

You can change the default options of how non-alphanumeric character are pre-processed.

// Do not remove non alpha-numeric characters,
// consider string case while comparing.
// The default full_process option is 'true'.
fuzz_ratio = fuzz.ratio("hello world", "Hello World!", {full_process: false});
 
// Returns 78
// force_ascii will strip out non-ascii characters except designated wildcards
fuzz_ratio = fuzz.full_process("myt^eäXt!");
// Returns myt eäxt
 
fuzz_ratio = fuzz.full_process("myt^eäXt!", {force_ascii: true});
// Returns myt ext

Use cases for fuzzy matching

There are many use cases for which we can use fuzzy string matching. One the most frequently encountered is correcting sloppily, misspelled data or data which has two variations. Take the following example of the tv show ‘2 broke girls’. Most people enter two variations of the name.

2 broke girls
two broke girls

Lets us see how the various ratio method compare.

console.log(fuzz.ratio("2  broke girls", "two broke girls"))
// Returns 86
 
console.log(fuzz.partial_ratio("2  broke girls", "two broke girls"))
// Returns 92

We can see that the ‘Partial Ratio’ returns a higher number which is quite useful to us when comparing the names of the tv show. In the last project I had used fuzzball to filter a long list of names in a CSV file – rejecting names after comparison which fall below a certain fuzz ratio.

The exact type of fuzz ratio you will need to use will depend on your application need and this will be trial, error and test process.

* Fuzzball is inspired by the Python fuzzywuzzy module.

Leave a Reply

Your email address will not be published. Required fields are marked *