Visualizing missing data in databases

Missing data in databases can cause bugs in applications or incorrect calculations. Recently, while working on a RETS application, I needed to ensure that not many missing values were encountered in one of the MySQL tables. Although one could easily write a SQL query to find the percentage of missing values, I many times find it easier to first get a visual representation of the amount of missing data there is in the table, and then drill-down further if required. One library that I found that lets you easily get a visual representation of missing data in your database tables is missingno – a Python library.

We can use the missingno library to visualize missing data in database tables, CSV files or other tabular data structures if it can be properly read by Pandas

Installing missingno library

First we will use pip to install the library.

pip install missingno

Visualizing the missing data

I find it easier to convert the database table to a CSV file which will then be used for the display part using Python below. Here my CSV sample is of 418 rows. But it can runs into thousands.

For me the most productive way to get a feel for missing data is to use a matrix density display. The initial code is simple enough. Note that this also uses Pandas to make it easier to read the CSV file and convert it to a dataframe which the missingno library requires.

import pandas as pd 
import missingno as msno 
 
# Loading the dataset 
df = pd.read_csv("data.csv")
 
# Visualize missing values as a matrix 
im = msno.matrix(df)
im.get_figure().savefig("data.png")

The vertical black bands are the columns. The horizontal white lines are the missing values. The sparkline at the right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the data. By default the library omits the column names for a large display as it becomes unreadable for a larger dataset.

If you have thousands or millions of rows you can restrict the sample to a particular size.

msno.matrix(df.sample(500)) # Only display 500 rows from the data set

As you can see from the above display the display is quite dense. All the table columns are included in the display. We can, however, display a selective number of columns. For example, we can display the bottom 10 columns that have the most set of missing values.

# Loading the dataset 
df = pd.read_csv("data.csv") 
 
# Visualize missing values as a matrix 
filtered_data = msno.nullity_filter(df, filter='bottom', n=10) # or filter='top'
im = msno.matrix(filtered_data)
im.get_figure().savefig("data.png")

Here, only 10 columns which have the most missing values are displayed. You can change this to any number of columns depending on your table size.

Additional parameters

There are various parameters with which you can tweak the final display.

# SIZE : 
im = msno.matrix(filtered_data, figsize=(10,6)) # 10 inch by 6 inch image
 
# COLOR:
im = msno.matrix(filtered_data, color=(0.1, 0.2, 0.5)) # Blue color
 
# FONT-SIZE
im = msno.matrix(filtered_data, fontsize=(10))

You can also display a barchart, histogram and dendogram for the given dataset. Maybe I’ll cover those in some other post.

Leave a Reply

Your email address will not be published. Required fields are marked *