Calculating descriptive statistics in MySQL


Descriptive statistics can be quite useful for simple analysis of records in a database. For example, to calculate average numbers of sales or products for a particular duration, or the Variance of sales for a month etc.

We can easily calculate standard descriptive statistic measures in MySQL such as MEAN, SUM, STANDARD DEVIATION, VARIANCE, MIN and MAX using built-in functions. Below is a sample student table with student id and scores which we will use for calculating descriptive statistical measures.

+------------+-------+
| student_id | score |
+------------+-------+
|          1 |    91 |
|          2 |    72 |
|          3 |    98 |
|          4 |    62 |
|          5 |    62 |
|          6 |    95 |
|          7 |    83 |
|          8 |    86 |
|          9 |    56 |
|         10 |    97 |
|         11 |    58 |
|         12 |    71 |
|         13 |    87 |
|         14 |    83 |
|         15 |    98 |
+------------+-------+

The following query calculates some standard statistic measures for the above data.

SELECT COUNT(score) AS N
	SUM(score) AS sum,
	AVG(score) AS mean,
	STDDEV_SAMP(score) AS 'std. dev.',
	VAR_SAMP(score) AS 'variance',
	MIN(score) AS minimum,
	MAX(score) AS maximum
FROM student;

This is what we get.

+----+------+---------+-----------+----------+---------+---------+
| N  | sum  | mean    | std. dev. | variance | minimum | maximum |
+----+------+---------+-----------+----------+---------+---------+
| 15 | 1199 | 79.9333 |   15.2103 | 231.3524 |      56 |      98 |
+----+------+---------+-----------+----------+---------+---------+

Generating a MEDIAN however can get a little tricky and we have to resort to some PHP code. For the above data the MEDIAN is calculated as 83. This is not a optimal solution and you could use an array instead.

/* Get total number of records */
$query = "SELECT count(*) as total FROM student";
$result = mysqli_query($con, $query);
$row = $result->fetch_assoc();
$total = $row['total'];
$median = 0;
 
/* If odd number of rows */
if($total % 2 == 1) {
    $median_location = ceil(($total - 1)/2);
    $query = "SELECT *
                FROM student
                ORDER BY score ASC
                LIMIT {$median_location},1";
    $result = mysqli_query($con, $query);
    $row = $result->fetch_assoc();
    $median = $row['score'];
} else {
    /* Even number of rows */
    $median_location = ($total/2)-1;
    $query = "SELECT *
                FROM student
                ORDER BY score ASC
                LIMIT {$median_location}, 2";
    $result = mysqli_query($con, $query);
    $temp = 0;
    while($row = $result->fetch_assoc()) {
        $temp += $row['score'];
    }
    $median = $temp / 2;
}

Calculating MODE also requires some extra code. MODE refers to the most frequently occurring item in our data set; in our example it is the most commonly occurring ‘score’. First we need to get the frequency of all the scores, and then retrieve the top result.

SELECT score, COUNT(score) AS frequency
    FROM student 
	 GROUP BY score 
	 ORDER BY frequency DESC;

In our example it will return the following, where the most commonly occurring items are 98, 62 and 83. There can be multiple MODE values for a dataset.

+-------+-----------+
| score | frequency |
+-------+-----------+
|    98 |         2 |
|    62 |         2 |
|    83 |         2 |
|    72 |         1 |
|    95 |         1 |
|    86 |         1 |
|    56 |         1 |
|    97 |         1 |
|    58 |         1 |
|    71 |         1 |
|    87 |         1 |
|    91 |         1 |
+-------+-----------+

Although not comprehensive data tools, these are some simple statistic measures to make a quick sense of your data.

This site is a digital habitat of Sameer Borate, a freelance web developer working in PHP, MySQL and WordPress. I also provide web scraping services, website design and development and integration of various Open Source API's. Contact me at metapix[at]gmail.com for any new project requirements and price quotes.

Your thoughts