There is nothing painful than having to rebuild a database after you find that the records in the database are corrupted. Data corruption in databases can take many forms – from allowing inaccurate data to be entered to data field truncation due to inadequate field length. This post gives an introduction to the data quality issues you will encounter when designing database applications.
My last data related web project was for a client who needed to replicate an online RETS data to a local MySQL database. RETS stands for ‘Real Estate Transaction Standard’, and is a data standard for real estate brokers to exchange property listings. Each RETS source has multiple tables and each has on an average around 200 fields. The project was completed on schedule and was in test operation for a week when I found that some of the fields were storing truncated data. On quick investigation I found that the field length had been set to a smaller value than required. There was no other way to correct this than to update the field length and re-download the complete data again.Thankfully, RETS gives you the option of downloading only the fields you specify, so the correction was done quickly and the project went live, with the testing stage snafu quickly forgotten.
1.1 Introduction to Data Quality
One of the most important aspects of a web application is its database. It is a integral part of any web application and hence warrants a careful design stage. In this post I refer to Data quality as a umbrella term to encompass various facets of data issues one needs to consider in order to build a reliable database.
There are many types of data quality issues you should be aware of. In the next section we will look into the various issues that arise in data quality implementation. Address them all and you can be certain that you build a reliable database for your application.
1.2 Dimensions of data quality
There are many dimensions to data quality and the importance of each may depend on your particular application context. The summary of the various dimensions are given below and will be further explained in later chapters.
– This obviously is the most important and essential aspect of data quality one should consider. A misspelled name, an inaccurate numeric value, out-of-range values, wrong numeric units, incorrect birth dates – all represent inaccurate data.
– Even though you are making sure that the data is accurately entered ,some programming bug or human error may cause database fields to be not saved or entered, causing incomplete data to be persisted. This can manifest itself in various ways – truncated fields, missing field values, missing of entire records etc.
– A person who has two clocks is never sure. In the same manner, same data within multiple tables or databases should be consistent with one another. For example, birth date of a particular person in a table should match the birth date for the same person in other table. If there is any conflict it should be resolved immediately.
– A lock without a key is incoherent, they have to be together to make the object work as desired. Analogously, data between multiple data sources need to act in a coherent manner so as to make any sense of the data and provide a complete picture. For example, a users table should be linked to a invoice table by a set of key(s) to create a coherent sales transaction picture of the user.
– Although strictly not a part of data quality, a timely data refers to data that is available to the user at the required time and not afterwards. For example, if you are a stock analyst than you require data to be given to you in a particular window, after which time the data is usually stale and not relevant. As most web applications work with data that is transactional in nature – orders, payments etc, we will not go any further on this topic.
2.1 Data Entry Errors and Mitigation
In this section we will look into the accuracy aspects of data quality. Much part of data error occurs during data entry. When data is entered manually mistakes are bound to happen. Specially in fields like medical transcription where there is a large amount of information to be entered, chances are high that errors will creep in. Even small and simple data entry errors in prescription can prove fatal to the patient. As this book primarily focuses on data quality in web applications, we will look into the problems and solutions from a web programmers perspective.
Note that data collected from machines is also not free from errors. Machine failure can also be a source of undetected data corruption. There are a few major problem areas in data quality which occur during data entry in web forms. These are enumerated below.
2.1.1 Type of Data Entry Errors
Generally data entry error issues fall into four categories – *Transcription, Deletion, Insertion* and *Transposition*. We will look into each in detail below.
1. Transcription – Words or values entered are not those that were in the original data. This can sometimes be easily found and be benign, while other-times cause serious issues. Human transcription errors are commonly the result of typographical mistakes; putting one’s fingers in the wrong place while touch typing is the easiest way to make this error.
The following shows and example of a transcription error. The first section of the example is the data that is supposed to be entered while the later is the actual data entered.
– 27 September -> 27 Septemver
– James – > Janes
2. Deletion – One or more characters are accidentally deleted. This kind of errors are hard to detect while validating on a web form, specially peoples names, addresses etc. For numeric values however we can do a range check for entered values.
– 867.560 -> 867.50
– Bitter -> Biter
2. Insertion – One or more characters are accidentally inserted.
– 867.560 -> 8678.560
– Richard -> Richared
4. Transposition – One or more characters has switched places. Transposition errors are almost always human in origin. The most common way for characters to be transposed is when a user is touch typing at a speed that makes them input one character, before the other.
– 867.560 -> 867.650
– Alice -> Aliec
– Joshua -> Johsua
Besides the above another subtle source of errors is known as dittography and haplography – terms taken from literature.
Dittography is when a character is accidentally copied twice.
– 867.560 -> 867.5660
– James -> Jammes
Haplography is where a existing double character or character group is accidentally omitted.
– 8667.560 -> 867.560
– Mississippi – > Missippi
2.1.2 Data Entry Error Mitigation
Double data entry or “two-pass verification” is considered to be the standard approach to reduce data entry errors, especially transcription and transposition errors. However, as double-entry needs to be carried out by two separate data entry persons, the expenses associated with double data entry are substantial. As the focus of this book is on web applications, we will be mostly looking into issues related to data input from web forms, where obliviously data is entered by a single person.
In a web application most of the data is usually entered manually using web forms, maybe sometimes gathered through web apis. As most data passes through a web form, adequate field validation on the client is the first step to catch wrong entered data.
For example, if there is a field for entering a numeric value with a range of 0-999, don’t accept a higher or lover value than the range. When using web forms, you can use the HTML5 ‘min’ ‘max’ attributes to limit the range as shown below.
<!-- Quantity (between 0 and 999): --> <input type="number" name="quantity" min="0" max="999">
The primary purpose of data entry mitigation on the client side should be to stop bad data from ever entering our application from the user.
2.1.3 Checking Invalid Numeric Values
Not all data is entered from a web form. Some is downloaded via an API, from third party providers or from legacy systems. Many times the data is just handed to you by a client and asked to work with. In such cases the developers job is to fish out any data abnormalities from the database.
In this scenario wherein you have to work with only a database, CSV or some other raw data format, a web form is of no use. You have to dive into the data and find erroneous data. However to be able to do this correctly requires that you have a good understanding of the domain that you are working with.
For example, with the help of SQL queries we can easily check for any data abnormalities in our database. With reference to our RETS dataset, we can check various fields for invalid or out-of-range values. For example, the total number of bedrooms in a residential house should be between 1 – 10 (yes there are houses with 10 bedrooms or larger); and obviously we cannot usually have a negative or 0 number of bedrooms.
A simple MySQL query will easily give you the minimum and maximum values for the bedroom field from the database.
#BedsTotal is a RETS field SELECT MIN(BedsTotal) as MinBeds, MAX(BedsTotal) as MaxBeds FROM `listings`
This will display the minimum and maximum count of bedrooms for any property. If this show any negative or a very large value that surely something is wrong. You can than isolate the particular record or a set of records and investigate it further.
#Find all records where total bedrooms in a house > 10 #Unique_ID is the Primary Key of the 'listings' table SELECT Unique_ID FROM `listings` WHERE BedsTotal > 10
You can also check the average value for a field to get a sense of the data and smell out any fishy values.
#DISTINCT returns the average of the distinct values SELECT AVG ( DISTINCT BedsTotal ) FROM listings
If you have some field with a few number of distinct value, than you can created a frequency distribution to fish out erroneous data. For example, say your database has a ‘AssociationFeeFrequency’ field, with four different frequency values – ‘Annual’, ‘Monthly’, ‘Quarterly’, ‘Semi-Annual’ and ‘Other’. Now you can create a frequency distribution to see if any wrong values have been entered.
SELECT AssociationFeeFrequency AS Frequency, COUNT(AssociationFeeFrequency) AS Total FROM `listings` GROUP BY AssociationFeeFrequency
This will display the following SQL output.
+-------------+-------+ | Frequency | Total | +-------------+-------+ | Annual | 300 | | Monthly | 59 | | Monthy | 3 | | Other | 5 | | Quarterly | 41 | | Semi-Annual | 76 | +-------------+-------+
You can easily see that there are three records with a wrong field value – ‘Monthy’ instead of ‘Monthly’. We can easily correct this with the following simple query.
UPDATE `listings` SET AssociationFeeFrequency = 'Monthly' WHERE AssociationFeeFrequency = 'Monthy'
To automate this you can use the following query to find the total distinct values a column has in a table.
SELECT COUNT( DISTINCT(AssociationFeeFrequency)) FROM `listings`
For the above example there are a total of 5 distinct values the ‘AssociationFeeFrequency’ column can take. The above query will give you the total distinct values for the ‘AssociationFeeFrequency’ column, which in our case should be 5 or less. Of course, it could happen that out of the 5 values only 3 are used and one is incorrect, returning a total distinct value as 4. In such cases you will not be able to detect the error with this method and the frequency distribution method will work.
You can also create a quick histogram to check the distribution of values for any particular numeric field. For example, in the RETS ‘listings’ table there is a ‘ListPrice’ fields which gives the property price. You can use the following MySQL query to display a quick histogram and detect any outliers.
SELECT ROUND(ListPrice, -5) AS bucket, COUNT(*) AS COUNT, LPAD('', LN(COUNT(*)), '*') AS bar FROM `listings` GROUP BY bucket;
The following is the histogram output. Note the ‘bar’ variable displays in logarithmic format.
+---------+-------+--------+ | bucket | COUNT | bar | +---------+-------+--------+ | 0 | 190 | ***** | | 100000 | 135 | ***** | | 200000 | 257 | ****** | | 300000 | 176 | ***** | | 400000 | 97 | ***** | | 500000 | 59 | **** | | 600000 | 36 | **** | | 700000 | 26 | *** | | 800000 | 16 | *** | | 900000 | 13 | *** | | 1000000 | 9 | ** | | 1100000 | 2 | * | | 1200000 | 2 | * | | 1300000 | 4 | * | | 1500000 | 4 | * | | 1600000 | 4 | * | | 1700000 | 3 | * | | 1800000 | 2 | * | | 1900000 | 2 | * | | 2000000 | 1 | | | 2100000 | 1 | | | 3000000 | 3 | * | | 4500000 | 1 | | | 4800000 | 2 | * | | 5100000 | 1 | | +---------+-------+--------+
This is a excerpt from a book which I will be releasing on Leanpub on data quality in web applications.