Porter Stemming algorithm for search


In this post we will see how to use a Stemming algorithm for search purposes.

A stemming algorithm lets you reduce each English input word to its basic root or stem (e.g. ‘walking’ to ‘walk’) so that variations on a word (‘walks’, ‘walked’, ‘walking’) are considered equivalent when searching. This stems can than be used in a search query rather than the original words, which generally (but not always) results in more relevant search results. The main use of stemming is in keyword indexing for search. For example if you have a article or document titled ‘blogging tips for late workers‘ and you run it through the algorithm you will get a list of stems for the title – blog, tip, late, worker; under which you can than index the article or document.

The original paper on the algorithm by Martin Porter, generally known as the Porter Stemming algorithm can be found here. The Porter Stemming algorithm essentially works by stripping suffixes from a word by using certain rules.

There are many implementation of the algorithm in various languages, so we will use one of those for our job. We will use a PHP5 implementation by Richard Heyes which can be downloaded from here.

Below is an example of the use of the class, which is as simple as it can ever get.

<?php
 
    require_once 'libs/PorterStemmer.php';
 
    $word_to_stem = 'walking';
    $stem = PorterStemmer::Stem($word_to_stem);
    echo $stem; // returns 'walk'
 
?>

The following example will create a list of stem words from a article title, also removing stop words from the list if any.

<?php
 
    /*Create a list of stem words for a article title */
 
 
    require_once 'libs/PorterStemmer.php';
 
    $title = "blogging tips for late workers";
 
    /* Stop words to filter out. Not comprehensive though  */
    $stop_words = array("the", "and", "a", "to", "of", "in", "i", "is",
                         "that", "it", "on", "you", "this", "for", "but",
                         "with", "are", "have", "be", "at", "or", "as",
                         "was", "so", "if", "out", "not");
 
    $words = explode(" ", $title);
 
    foreach($words as $word) {
        $stem = PorterStemmer::Stem($word);
 
        /* Remove stop words */
        if(!in_array($stem, $stop_words)) {
            $stem_words[] = $stem;
        }
    }
 
    print_r($stem_words);
 
?>

Which will return the following:

Array
(
    [0] => blog
    [1] => tip
    [2] => late
    [3] => worker
)

The algorithm is basically useful when you want to index documents or extend search for morphologically related words. Although it sometimes gives amusing results, it can be quite helpful at appropriate times.

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

7 Responses

1

Guy Patterson

April 22nd, 2009 at 10:34 am

How would you output the stems? In other words, based off your example, how would I output:

blog
blogging
blogged
blogger

tip
tips
tipped
tipping
tipper

late
later
latest

work
worker
workers
working

Or is this even a possibility?

Thanks,

Guy
https://www.nullamatix.com/pubkey.txt

sameer

April 22nd, 2009 at 10:03 pm

What you are trying to achieve is the reverse of Porter Stemming. Basically you want to generate inflections from the stem word which is not possible using this algorithm, as it is a simple algorithm without any dictionary lookup. To generate the word inflections as given by you will require a stemmer with a dictionary and understanding of some grammar.

One possibility I could think of is to use a plain dictionary with the above algorithm and extract those words that reduce down to a particular stem. This words would be than the inflections of the original stem. For e.g to find the inflections of the word ‘work’ we would pass all the words starting with ‘work’ (or ‘w’ to make it simpler) through the stemmer and only select those that reduce down to ‘work’. Its a crude method though, devoid of any understanding of verbs.

Or go with NLTK.

3

Naz

April 26th, 2009 at 8:39 am

This algorithm has a chance to match word about 60% at least php version.
For example days of the week or months

4

Ivo Jansch

April 29th, 2009 at 3:32 pm

This is indeed a very useful algorithm. It’s one of the algo’s I use in http://flackr.net to match tweets. Since tweets have only a limited amount of words, comparing them by porterstem helps combine those that belong together.

@guy: there is generally no need to output them. Like sameer said it’s not possible, but regardless of that you generally don’t need it. You use it internally to do the search comparisons; what you display are the actual search terms as entered by the user and the search results.

5

Sameer Borate’s Blog: Porter Stemming algorithm for search : WebNetiques, LLC : Website Developers in Minneapolis, MN

April 29th, 2009 at 9:48 pm

[...] a recent post to his blog Sameer looks at implementing a Stemming algorithm to search an array of words. It uses this library (as written by Richard [...]

6

Sameer Borate’s Blog: Porter Stemming algorithm for search : Dragonfly Networks

April 29th, 2009 at 9:50 pm

[...] a recent post to his blog Sameer looks at implementing a Stemming algorithm to search an array of words. It uses this library (as written by Richard [...]

7

Porter Stemming algorithm for search : CodeDiesel

May 2nd, 2009 at 5:09 am

[...] Read more here: Porter Stemming algorithm for search : CodeDiesel [...]

Your thoughts