Porter Stemming algorithm for search

In this post we will see how to use a Stemming algorithm for search purposes.

A stemming algorithm lets you reduce each English input word to its basic root or stem (e.g. ‘walking’ to ‘walk’) so that variations on a word (‘walks’, ‘walked’, ‘walking’) are considered equivalent when searching. This stems can than be used in a search query rather than the original words, which generally (but not always) results in more relevant search results. The main use of stemming is in keyword indexing for search. For example if you have a article or document titled ‘blogging tips for late workers‘ and you run it through the algorithm you will get a list of stems for the title – blog, tip, late, worker; under which you can than index the article or document.

The original paper on the algorithm by Martin Porter, generally known as the Porter Stemming algorithm can be found here. The Porter Stemming algorithm essentially works by stripping suffixes from a word by using certain rules.

There are many implementation of the algorithm in various languages, so we will use one of those for our job. We will use a PHP5 implementation by Richard Heyes which can be downloaded from here.

Below is an example of the use of the class, which is as simple as it can ever get.

<?php
 
    require_once 'libs/PorterStemmer.php';
 
    $word_to_stem = 'walking';
    $stem = PorterStemmer::Stem($word_to_stem);
    echo $stem; // returns 'walk'
 
?>

The following example will create a list of stem words from a article title, also removing stop words from the list if any.

<?php
 
    /*Create a list of stem words for a article title */
 
 
    require_once 'libs/PorterStemmer.php';
 
    $title = "blogging tips for late workers";
 
    /* Stop words to filter out. Not comprehensive though  */
    $stop_words = array("the", "and", "a", "to", "of", "in", "i", "is",
                         "that", "it", "on", "you", "this", "for", "but",
                         "with", "are", "have", "be", "at", "or", "as",
                         "was", "so", "if", "out", "not");
 
    $words = explode(" ", $title);
 
    foreach($words as $word) {
        $stem = PorterStemmer::Stem($word);
 
        /* Remove stop words */
        if(!in_array($stem, $stop_words)) {
            $stem_words[] = $stem;
        }
    }
 
    print_r($stem_words);
 
?>

Which will return the following:

Array
(
    [0] => blog
    [1] => tip
    [2] => late
    [3] => worker
)

The algorithm is basically useful when you want to index documents or extend search for morphologically related words. Although it sometimes gives amusing results, it can be quite helpful at appropriate times.



7 thoughts on “Porter Stemming algorithm for search

  1. What you are trying to achieve is the reverse of Porter Stemming. Basically you want to generate inflections from the stem word which is not possible using this algorithm, as it is a simple algorithm without any dictionary lookup. To generate the word inflections as given by you will require a stemmer with a dictionary and understanding of some grammar.

    One possibility I could think of is to use a plain dictionary with the above algorithm and extract those words that reduce down to a particular stem. This words would be than the inflections of the original stem. For e.g to find the inflections of the word ‘work’ we would pass all the words starting with ‘work’ (or ‘w’ to make it simpler) through the stemmer and only select those that reduce down to ‘work’. Its a crude method though, devoid of any understanding of verbs.

    Or go with NLTK.

  2. This algorithm has a chance to match word about 60% at least php version.
    For example days of the week or months

  3. This is indeed a very useful algorithm. It’s one of the algo’s I use in http://flackr.net to match tweets. Since tweets have only a limited amount of words, comparing them by porterstem helps combine those that belong together.

    @guy: there is generally no need to output them. Like sameer said it’s not possible, but regardless of that you generally don’t need it. You use it internally to do the search comparisons; what you display are the actual search terms as entered by the user and the search results.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>