Tuesday, 14 December 2010

Medians and Quartiles

Problem statement:

From a subset of records (identified by a where clause) find the student with the mean salary, the student at the 25 percentile and the student at 75 percentile. Also get the "outliers", which are the students with the highest and the lowest salaries.

Added complication: Each student record also has a weighting associated with it, which goes from 1 to 1000, and this needs to be considered when counting the percentiles. In effect, this means that each student records represents a thousand "mili-students" and we need to find the student at the correct percentile.

All the following solutions need a supporting function which will calculate the required rank (e.g. for median) when given the total number of results. E.g. if the total is 11, the mean position is 6.

Solution 1 (the most basic and slowest):

We created another large table with each student repeated a thousand times. Then ordered this expanded result set by salary, looped through this ordered list until we were at the correct rank.

Obviously given that the result set was expanded a thousand times, this solution wasn't great for performance.

Solution 2 (Improvement, but still quite cumbersome):

We don't repeat each student a thousand times. Instead, we know the mean position in terms of the weight (e.g. the student at position 6.334 is the mean) and then while looping through the ordered list of students we keep a cumulative total of weights. As soon as the cumulative total goes beyond this position we exit the loop and return that student's salary as the mean value.

Solution 3 (Fastest and fits in a single SELECT statement):

Use Oracle Analytic functions to get the cumulative sum mentioned above, and the SIGN function to find whether it is more than or equal to the mean position. Then get the minimum where the sign value is on the right side. The query looks like this:


SELECT country,
        MAX(DECODE(ranking, 1, salary, NULL)) outlier1,
        (MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q11')), -1, NULL, salary)) +
        MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q12')), -1, NULL, salary)))/2 Q1,
        (MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'M1')), -1, NULL, salary)) +
        MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'M2')), -1, NULL, salary)))/2 mean,
        (MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q21')), -1, NULL, salary)) +
        MIN(DECODE(SIGN(weights_cumm_sum - get_quartile_index(weights_sum, 'Q22')), -1, NULL, salary)))/2 Q3,
        MAX(DECODE(ranking, country_count, salary, NULL)) outlier2
FROM
    (SELECT country, student_id, salary, weight,
             SUM(weight) OVER (PARTITION BY country ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) weights_cumm_sum,
            SUM(weight) OVER (PARTITION BY country) weights_sum,
            RANK() OVER (PARTITION BY country ORDER BY salary) ranking,
            COUNT(*) OVER (PARTITION BY country) country_count
    FROM tqi_salary_test) s
GROUP BY country

The inner query orders the students in each country by their salary and calculates a cumulative sum of student_weighting. The outer query finds the minimum cumulative value where it equals or exceeds the required quartile value (supplied by the quartile index function) and returns the corresponding salary. The outliers are accessed simply based on their ranking by salary.

The above mentioned function to return the required quartile index looks like this:

CREATE OR REPLACE FUNCTION TQI.get_quartile_index(
    pn_count IN NUMBER,
    pv_index_required IN VARCHAR2
) RETURN NUMBER IS
    ln_m_1 NUMBER;
    ln_m_2 NUMBER;
    ln_q_1_1 NUMBER;
    ln_q_1_2 NUMBER;
    ln_q_2_1 NUMBER;
    ln_q_2_2 NUMBER;
    ln_output NUMBER;
BEGIN
    IF (MOD(pn_count, 2) = 1) THEN --Odd number of rows, e.g. 11 or 13.
        ln_m_1 := (pn_count+1)/2; --median row will be 6 or 7.
        ln_m_2 := ln_m_1;
        IF (MOD(ln_m_1, 2) = 1) THEN --median is odd number, say 7 
            ln_q_1_1 := (ln_m_1-1)/2; --first quartile is 3 and 4.
            ln_q_1_2 := (ln_m_1+1)/2; --first quartile is 3 and 4.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 9 and 10.
            ln_q_2_2 := ln_m_1 + ln_q_1_2; --second quartile is 9 and 10.
        ELSE --median 1 is even, say 6.
            ln_q_1_1 := ln_m_1/2; --first quartile is 3.
            ln_q_1_2 := ln_q_1_1; --first quartile is 3.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 9.
            ln_q_2_2 := ln_q_2_1; --second quartile is 9.
        END IF; 
    ELSE--Even number of rows, e.g. 12 or 14.
        ln_m_1 := pn_count/2; --median 1 will be 6 or 7  
        ln_m_2 := ln_m_1 + 1; --median 2 will be 7 or 8 
        IF (MOD(ln_m_1, 2) = 1) THEN --median 1 is odd number, say 7 
            ln_q_1_1 := (ln_m_1+1)/2; --first quartile is 4.
            ln_q_1_2 := ln_q_1_1; --first quartile is 4.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 11
            ln_q_2_2 := ln_q_2_1; --second quartile is 11.
        ELSE --median 1 is even, say 6.
            ln_q_1_1 := ln_m_1/2; --first quartile is 3 and 4.
            ln_q_1_2 := ln_q_1_1+1; --first quartile is 3 and 4.
            ln_q_2_1 := ln_m_1 + ln_q_1_1; --second quartile is 9 and 10.
            ln_q_2_2 := ln_q_2_1 + 1; --second quartile is 9 and 10.
        END IF; 
    END IF;
    
    CASE pv_index_required
    WHEN 'M1' THEN ln_output := ln_m_1;
    WHEN 'M2' THEN ln_output := ln_m_2;
    WHEN 'Q11' THEN ln_output := ln_q_1_1;
    WHEN 'Q12' THEN ln_output := ln_q_1_2;
    WHEN 'Q21' THEN ln_output := ln_q_2_1;
    WHEN 'Q22' THEN ln_output := ln_q_2_2;
    END CASE;
    
    RETURN ln_output;
END get_quartile_index;
/