Splitting a text on word boundaries


Substring extraction of a long piece of text is a common element in web design, mostly in blogs and other CMS for displaying excerpts. The most common way to show an excerpt is to get n number of characters of the text, or n number of words. We will explore both ways. In PHP we frequently use the substr function. However, substr or any of its variants does not split the text at word boundaries, keeping broken words hanging at the end. There are many ways we can prevent this, using some text adjustments. A simple one is shown here.

First, let us take the following text as an input string.

$text = 'Upon my back, to defend my belly;
         upon my wit, to defend my wiles;
         upon my secrecy, to defend mine honesty;
         my mask, to defend my beauty';

Splitting the text using the PHP substr function returns the following string, cutting the string at the word ‘defend’, which we do not want.

echo substr($text,0, 54) . "...";
# returns -> Upon my back, to defend my belly; upon my wit, to defe...

The splitText function

A simple function to break text at word boundaries is shown below.

function splitText($text, $maxLength)
{
    /* Make sure that the string will not be longer
       than $maxLength.
     */
    if(strlen($text) > $maxLength)
    {
        /* Trim the text to $maxLength characters */
        $text = substr($text, 0, $maxLength - 1);
 
        /* Split words only at boundaries. This will be
           accomplished by moving back each character from
           the end of the split string until a space is found.
         */
        while(substr($text,-1) != ' ')
        {
            $text = substr($text, 0, -1);
        }
 
        /* Remove the whitespace at the end. */
        $text = rtrim($text);
    }
    return $text;
}

Using the above splitText function we will again try to extract a substring of 54 characters, which gives us the following string. Note that the length of the output string is now less than 54 characters as we had to make sure that we did not break any words.

echo splitText($text, 54) . "...";
# returns -> Upon my back, to defend my belly; upon my wit, to...

The important element in the function is the inner while loop, which back tracks through the given sub-string characters until a space character is found, thereby indicating a word boundary. The intermediate results of the $text variable as the loop iterates are shown below.

Upon my back, to defend my belly; upon my wit, to defe
Upon my back, to defend my belly; upon my wit, to def
Upon my back, to defend my belly; upon my wit, to de
Upon my back, to defend my belly; upon my wit, to d
Upon my back, to defend my belly; upon my wit, to

Splitting text by counting words

Instead of splitting text by number of characters we can instead split by counting words. So instead of asking to extract 54 characters from the start of the text, we can ask to extract 11 words. For that we can use the following splitTextByWords function.

function splitTextByWords($str, $words = 10)
{
    $arr = preg_split("/[\s]+/", $str, $words+1);
    $arr = array_slice($arr, 0, $words);
    return join(' ', $arr);
}

We can use it as following.

echo splitTextByWords($text, 11) . "...";
# returns -> Upon my back, to defend my belly; upon my wit, to...

To check if this works correctly you run the above function in a loop.

for($i=1; $i <= 11; $i++)
{
    echo splitTextByWords($text, $i) . "<br>";
}

This will output the following.

Upon
Upon my
Upon my back,
Upon my back, to
Upon my back, to defend
Upon my back, to defend my
Upon my back, to defend my belly;
Upon my back, to defend my belly; upon
Upon my back, to defend my belly; upon my
Upon my back, to defend my belly; upon my wit,
Upon my back, to defend my belly; upon my wit, to

Of course if you want a somewhat tighter control over the character count in the excerpt, the splitText function is a more appropriate solution. Although there are many more variations of the above two methods, the ones given here should suffice for most practical purposes.

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

3 Responses

1

Dotan Cohen

December 13th, 2012 at 1:54 pm

This:
while(substr($text,-1) != ‘ ‘)
{
$text = substr($text, 0, -1);
}

$text = rtrim($text);

Could be easily replaced with this:
$text = substr($text, 0, strrpos($text, ‘ ‘));

2

Dotan Cohen

December 13th, 2012 at 1:56 pm

As for splitting on words, the regex engine will match word boundaries on ‘\b’.

3

Cubicle Ninjas

February 8th, 2013 at 8:20 am

Wow! This is a great simple solution! Thanks for this!

Your thoughts