Technical Marketing Guide

A travel guide for technical marketing – get inspired and start exploring yourself.

How to calculate the median per group with MySQL

The other day I had to calculate the median per group within MySQL. Turns out this is not as easy as it sounds mostly due to the fact that there exists no median() function in MySQL. After quite some trial and error and some helpful tipps I figured it out though. Especially for people that are not that experienced with MySQL, here is my own writeup of the solution.

Here’s the example we’re starting with. A series of data points for a two groups of candidates (male and female). We want to calculate the median height per group.

gender height
male 1.85
male 1.75
male 1.6
female 1.55
female 1.6
male 1.6
male 2.01
female 1.8

We’re starting very easy by selecting both columns and sorting them by group and height.

SELECT 
    gender, height
FROM
    heights
ORDER BY gender , height asc

Returning the following result:

gender height
female 1.55
female 1.6
female 1.8
male 1.6
male 1.6
male 1.75
male 1.85
male 2.01

Next up, we’ll need to count the rows per group and add it to the table as another column.

SELECT 
    gender,
    height,
    (SELECT 
        COUNT(*)
        FROM
            heights
        WHERE
            a.gender = gender) AS total_of_group
FROM
    (SELECT 
        gender, height
    FROM
        heights
    ORDER BY gender , height) AS a

Resulting:

gender height total_of_group
female 1.55 3
female 1.6 3
female 1.8 3
male 1.6 5
male 1.6 5
male 1.75 5
male 1.85 5
male 2.01 5

Now the select becomes a little bit more complex but this is the real important part. We’re adding two variables row_number and median_group. Row number should count up 1 each row until we’re hitting the next group, then it should start again at 1. We’re achieving this with the median_group helper variable. At each result the median_group variable is set to the rows group name (here gender). The case select detects if there has been a change of the groups name and resets the counter accordingly.

SET @row_number:=0; 
SET @median_group:='';

SELECT 
    @row_number:=CASE
        WHEN @median_group = gender THEN @row_number + 1
        ELSE 1
    END AS count_of_group,
    @median_group:=gender AS median_group,
    gender,
    height,
    (SELECT 
            COUNT(*)
        FROM
            heights
        WHERE
            a.gender = gender) AS total_of_group
FROM
    (SELECT 
        gender, height
    FROM
        heights
    ORDER BY gender , height) AS a

Result:

count_of_group median_group gender height total_of_group
1 female female 1.55 3
2 female female 1.6 3
3 female female 1.8 3
1 male male 1.6 5
2 male male 1.6 5
3 male male 1.75 5
4 male male 1.85 5
5 male male 2.01 5

Now we have everything to actually calculate the median. We’re wrapping another select around the current one and apply the median calculations.

By dividing the count of the group by 2 (total_of_group / 2.0) and doing the same adding 1 (total_of_group / 2.0 + 1) we’re defining a search range for our select that we apply to the count_of_group. What this means is that we select only those rows that lie between this range, effectively selecting the middle of each groups running count.

We’re also applying an average to the select (AVG(height)) because there might not always be only one row that gets selected for each group.

SET @row_number:=0; 
SET @median_group:='';

SELECT 
    median_group, AVG(height) AS median
FROM
    (SELECT 
        @row_number:=CASE
                WHEN @median_group = gender THEN @row_number + 1
                ELSE 1
            END AS count_of_group,
            @median_group:=gender AS median_group,
            gender,
            height,
            (SELECT 
                    COUNT(*)
                FROM
                    heights
                WHERE
                    a.gender = gender) AS total_of_group
    FROM
        (SELECT 
        gender, height
    FROM
        heights
    ORDER BY gender , height) AS a) AS b
WHERE
    count_of_group BETWEEN total_of_group / 2.0 AND total_of_group / 2.0 + 1
GROUP BY median_group

Final result:

median_group median
female 1.6
male 1.75

And we’re done. Exactly what we wanted to achieve. Here is the SQL Fiddle of the whole thing. Please let me know if this post helped you or if there is an easier way of doing this.

Next Post

Previous Post

Leave a Reply

Never miss another guide, sign up now

Cookie Settings

© 2018 Technical Marketing Guide