In this post we will see how to use a Stemming algorithm for search purposes.
A stemming algorithm lets you reduce each English input word to its basic root or stem (e.g. ‘walking’ to ‘walk’) so that variations on a word (‘walks’, ‘walked’, ‘walking’) are considered equivalent when searching. This stems can than be used in a search query rather than the original words, which generally (but not always) results in more relevant search results. The main use of stemming is in keyword indexing for search. For example if you have a article or document titled ‘blogging tips for late workers‘ and you run it through the algorithm you will get a list of stems for the title – blog, tip, late, worker; under which you can than index the article or document.
The original paper on the algorithm by Martin Porter, generally known as the Porter Stemming algorithm can be found here. The Porter Stemming algorithm essentially works by stripping suffixes from a word by using certain rules.
There are many implementation of the algorithm in various languages, so we will use one of those for our job. We will use a PHP5 implementation by Richard Heyes which can be downloaded from here.
Below is an example of the use of the class, which is as simple as it can ever get.
The following example will create a list of stem words from a article title, also removing stop words from the list if any.
Which will return the following:
Array (  => blog  => tip  => late  => worker )
The algorithm is basically useful when you want to index documents or extend search for morphologically related words. Although it sometimes gives amusing results, it can be quite helpful at appropriate times.