Data cleaning in PHP applications

One of the important tasks in any web application is proper sanitization and standardization of data. Any data stored in a database should be in a standardized format, specially data that comes from a variety of sources.

Scrubbers or data cleaners are an important part of the data transformation process. Whenever you are involved in some data import or export process, data scrubbers can help you clean and standardize your data elements before storing.

There are many libraries that help in sanitizing and cleaning data. One such I recently found is mr-clean; it is a extendible PHP Data Cleaner that you can use in your PHP applications to clean heterogeneous data before storing it in your database or other persistent storage like CSV files.

Before continuing make sure that you are running PHP 5.4 or later as the library uses some new features of PHP 5.4.

Installation

The library can be installed using composer. Add the following to your composer.json and run composer install.

{
    "require": {
        "joetannenbaum/mr-clean": "~0.0"
    }
}

Once the package is installed a vendor autoload file will be created which you can now use in your application as given below.

require_once 'vendor/autoload.php';
$cleaner = new MrClean\MrClean();

Basic data scrubbing

The simplest data scrubbing (cleaning) you can do is trim a string by removing extra spaces from the start and end of a string.

$scrubbed = $cleaner->scrubbers(['trim'])->scrub(' Hello World!  ');
echo $scrubbed; // 'Hello World!'

Although you may think it is sensible to use the PHP trim function, the PHP function does not easily work with arrays, while we can use MrClean class to trim strings from arrays and even objects. Note: we are using the new PHP 5.4 short array syntax, which replaces array() with [].

$array_of_strings = [" james", 
                     "  peter ", 
                     "george ",
 
                       [
                            "london  ", 
                            " new york  ", 
                            "delhi "
                        ]
                     ];
 
$scrubbed = $cleaner->scrubbers(['trim'])->scrub($array_of_strings);
print_r($scrubbed);
Array
(
    [0] => james
    [1] => peter
    [2] => george
    [3] => Array
        (
            [0] => london
            [1] => new york
            [2] => delhi
        )
 
)

The following shows how to remove HTML tags.

$html_string = ' <strong>The world is my <b>Oyster</b></strong>';
$scrubbed = $cleaner->scrubbers(['strip_tags'])->scrub($html_string);
The world is my Oyster

The parameter for ‘scrubbers’ function takes an array of data cleaners that will be run one after another on a object. In the above examples we have only used the trim and strip_tags scrubber individually. However you can specify and array of scrubbers.

$scrubbers = [
    'trim',
    'strip_tags'
];
 
$html_string = ' <strong>The world is my <b>Oyster</b></strong>';
$scrubbed = $cleaner->scrubbers($scrubbers)->scrub($html_string);
echo $scrubbed; // The world is my Oyster

The above example will apply the ‘trim’ scrubber to the string and later apply the ‘strip_tags’ scrubber.

Scrubbers are PHP classes and functions that actually do the work, and you can assign as many as you want to clean your object. Any single argument PHP string manipulation function can be used. To reference a class, simply convert the StudlyCase to snake_case. So we can use, trim, strip_tags, htmlentities, stripslashes and htmlspecialchars.

Besides the above examples, you can clean an array of arrays, a string, an array of objects, a single object.

Another useful feature is the ability to clean specific keys in an array. An example is provided below. Here the trim scrubber is only applied to the ‘email’ key and the strip_tags scrubber is only applied to the ‘last_name’ key.

$names = [ 
            ["first_name" => "<strong>peter</strong>",
             "last_name"  => "<i>james</i>",
             "email"      => " james@testmail.com"
            ],
            ["first_name" => "<strong>tom</strong>",
             "last_name"  => "hicks",
             "email"      => " tom@testmail.com"
            ]
          ];
 
 
$scrubbers = [
        'email' => ['trim'],
        'last_name'  => ['strip_tags'],
];
 
$scrubbed = $cleaner->scrubbers($scrubbers)->scrub($names);
Array
(
    [0] => Array
        (
            [first_name] => <strong>peter</strong>
            [last_name] => james
            [email] => james@testmail.com
        )
 
    [1] => Array
        (
            [first_name] => <strong>tom</strong>
            [last_name] => hicks
            [email] => tom@testmail.com
        )
 
)

The library also comes with a few pre built scrubbers – boolean, HTML, Strip CSS Attributes, Nullify, Null If Repeated and Strip Phone Number. Each is explained below.

Boolean

Boolean scrubber converts falsey text and anything considered empty to false, otherwise returns true. This can be useful when you need to standardize the true/false values in your application. Falsey text includes (not case sensitive):

no
n
false

$movies_seen = [
    'The Dark Knight'   => 'y',
    'The Green Lantern' => '1',
    'The Avengers'      => 'yes',
    'IronMan'           => 'n',
    'Star Trek'         => 'no',
    'Star Wars'         => '0',
    'Transformers'      => '  ',
    'Island'            => '4545' // convert anything else to a true
];
 
$scrubbed = $cleaner->scrubbers(['boolean'])->scrub( $movies_seen );
print_r($scrubbed);
Array
(
    [The Dark Knight] => 1
    [The Green Lantern] => 1
    [The Avengers] => 1
    [IronMan] => 
    [Star Trek] => 
    [Star Wars] => 
    [Transformers] => 
    [Island] => 1
)

HTML

Strips tags not on the whitelist, removes empty content tags, and repeated opening or closing tags. The whitelist includes: a, p, div, strong, em, b, i, br, ul, ol, li, h1, h2, h3, h4, h5, h6.

$dirty = '<p><p>Bad HTML here.</p><hr /><em></em><div>To be cleaned.</div>';
$scrubbed = $cleaner->scrubbers(['html'])->scrub( $dirty );
echo $scrubbed;
<p>Bad HTML here.</p><div>To be cleaned.</div>

To change the elements in the whitelist, you will need to edit the list in the ‘mr-clean\src\Scrubber\Html.php’ file.

Strip CSS Attributes

Strips the style, class, and id attributes off of all HTML elements.

$dirty = '<p style="font-weight:bold;" 
             id="bold-el" class="boldest">CSS attribute striper</p>';
$scrubbed = $cleaner->scrubbers(['strip_css_attributes'])->scrub($dirty);
echo $scrubbed;
<p>CSS attribute striper</p>

Nullify

If a trimmed string doesn’t have any length, null it out.

$dirty = [
    'some text',
    'another text',
    ' ',
    '    '
];
 
$scrubbed = $cleaner->scrubbers(['nullify'])->scrub($dirty);
print_r($scrubbed);
 
is_null($dirty[3]); // false
is_null($scrubbed[3]); // true
Array
(
    [0] => some text
    [1] => another text
    [2] => null
    [3] => null
)

Null If Repeated

If a string is just a repeated character (‘1111111′ or ‘aaaaaaaaa’) and has a length greater than two, null it out. Regarding the use case for this scrubber, the author explains – “To give some context to why I even made this in the first place, we were doing a large-ish migration of some pretty terrible data from one database to another, and we were cleaning it up along the way. This particular cleaner came from lazy users who would just fill fields with long strings of repeated characters, which we wanted to null out. Not sure if it’s applicable to anyone else, just figured I’d include it”.

$dirty = [
    '11111111',
    '22',
    'bbbbbbbb',
    '333334',
];
 
$scrubbed = $cleaner->scrubbers(['null_if_repeated'])->scrub($dirty);
print_r($scrubbed);
Array
(
    [0] => null
    [1] => 22
    [2] => null
    [3] => 333334
)

Strip Phone Number

Strip a phone number down to just the numeric bits, numbers and the letter ‘x’ (for extensions).

$dirty = [
    '555-555-5555',
    '(123) 456-7890',
    '198 765 4321 ext. 888',
];
 
$scrubbed = $cleaner->scrubbers(['strip_phone_number'])->scrub($dirty);
print_r($scrubbed);
Array
(
    [0] => 5555555555
    [1] => 1234567890
    [2] => 1987654321x888
)

Pre/Post scrubbing

To save some typing, you can set scrubbers to run every time before and after each cleaning.

$cleaner->pre(['trim']);
$cleaner->post(['htmlentities']);
 
$to_clean = '   <strong>This should be cleaned. &</strong>';
// 'trim' will run before each of these, 'htmlentities' after each
$scrubbed = $cleaner->scrubbers(['strip_tags'])->scrub($to_clean);
echo $scrubbed; // This should be cleaned. &amp;

Creating custom scrubbers

You can extend the scrubber by creating your own custom classes. First, write your class. All you have to do is extend MrClean\Scrubber\BaseScrubber which adheres to MrClean\Scrubber\ScrubberInterface. There is a single property, value available to you, this is the string you will manipulate.

Below is a example custom scrubber that only accepts numeric characters. Name this file ‘OnlyNumeric.php’ and copy this to your ‘mr-clean\src\Scrubber’ directory.

<?php
 
namespace MrClean\Scrubber;
 
class OnlyNumeric extends BaseScrubber {
 
    /**
     * Delete any non numeric characters
     *
     * @return string
     */
 
    public function scrub()
    {
        return preg_replace("/[^0-9]/","",$this->value);
    }
 
}

Once you create a custom scrubber you will need to register it in your code. The register method will take a string indicating the full path of the class, or an array of class paths.

$cleaner->register('Your\Namespace\YourCustomScrubber');

You can use it like the following in your code.

require_once 'vendor/autoload.php';
$cleaner = new MrClean\MrClean();
 
/* Register our custom scrubber */
$cleaner->register('MrClean\Scrubber\OnlyNumeric');
 
$dirty = '&*&01234 -5 -6-7-8!!9  Hello World!';
$scrubbed = $cleaner->scrubbers(['OnlyNumeric'])->scrub($dirty);
echo $scrubbed; // 0123456789

Or clean a dirty array.

$dirty = [
            '&*&01234 -5 -6-7-8!!9  Hello World!',
            '121949 30943 887&*&()',
            '  343565 &**64423232',
            '1234lsdksldsd--'
         ];
$scrubbed = $cleaner->scrubbers(['OnlyNumeric'])->scrub($dirty);
print_r($scrubbed);
Array
(
    [0] => 0123456789
    [1] => 12194930943887
    [2] => 34356564423232
    [3] => 1234
)