Home > Back-end >  Calculate Exact Values Of Lower and Upper Quartiles
Calculate Exact Values Of Lower and Upper Quartiles

Time:11-09

I have been looking all around internet and it seems like there is no answer that match my case.

I am struggling with calculating the exact Lower and Upper Quartile in SQL Server. I am aware that SQL Server has a built in function that facilitates calculation of quartiles, the NTILE function. But that is not enough for my case.

Given the following table with values (note that the table containes more products and prices than in the table below):

AveragePrice ProductNumber Year
45.7820 2 2015
46.0142 2 2016
59.0133 2 2017
60.1707 2 2018
62.6600 2 2019

I am running the following query:

SELECT 
    AveragePrice
    ,NTILE(4) OVER (
        PARTITION BY ProductNumber ORDER BY AveragePrice
        ) AS Quartile
FROM products

Which gives the following result:

AveragePrice Quartile
45.7820 1
46.0142 1
59.0133 2
60.1707 3
62.6600 4

For full context the query in it's whole looks like this:

    SELECT ProductNumber
    ,MIN(AveragePrice) Minimum
    ,MAX(CASE 
            WHEN Quartile = 1
                THEN AveragePrice
            END) AS Quartile_1
    ,
    MAX(CASE 
            WHEN Quartile = 3
                THEN AveragePrice
            END) AS Quartile_3
    ,MAX(AveragePrice) Maximum
    ,COUNT(Quartile) AS 'Number of items'
FROM (
    SELECT ProductNumber
        ,AveragePrice   
        ,NTILE(4) OVER (
            PARTITION BY ProductNumber ORDER BY ProductNumber
            ) AS Quartile
    FROM #temp_products
    
    ) Vals
GROUP BY ProductNumber
ORDER BY ProductNumber

But when I manually calculate the quartile the first quartile should be: 45.8981 (average of the first and second row in this particular case) not 46.0142.

The third quartile should be 61.41535 (average of the third and second quartile in this particular case) not 60.1707 .

So to make it clear. This is a part of a stored procedure where multiple pricegroups is calculated and aggregated into groups containing average prices. I need to calculate the upper and lower quartiles from these average prices grouped by product number. The result set should contain the productnumber, lower quartile and upper quartile. Can someone help me or guide me in the correct direction?

CodePudding user response:

NTILE() is rounding up in some strange ways, in some cases. I'd rather use Integer division with ranks for grouping. This solution works with any number of values, with pondered average when needed.

LEAD is the the magic windowed function to catch next row's value

select *
    ,[Q] = case when [rank] in ((N 3)/4 ,(N 1)/2, (3*N 1)/4) then
                case [decimal] 
                when 0    then AveragePrice
                when 0.25 then /*pondered avg*/(3*AveragePrice    LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 4
                when 0.5  then /*simple avg*/(    AveragePrice    LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 2
                when 0.75 then /*pondered avg*/(  AveragePrice  3*LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 4 
                end
           end
from(
    select  *
        ,[rank]     = ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)
        ,[N]        = SUM(1)over()
        ,[group4]   = ((ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)-1 )*4 / SUM(1)over())
        ,[decimal]  = case /*rank*/ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice) 
                        when /*Q1*/  (SUM(1)over() 3)/4   then   (SUM(1)over() 3)/4.0 - FLOOR((SUM(1)over() 3)/4.0) 
                        when /*Q2*/  (SUM(1)over() 1)/2   then   (SUM(1)over() 1)/2.0 - FLOOR((SUM(1)over() 1)/2.0)
                        when /*Q3*/(3*SUM(1)over() 1)/4   then (3*SUM(1)over() 1)/4.0 - FLOOR((3*SUM(1)over() 1)/4.0)
                      end
    from
    (values(45.7820,2,2015),(46.0142,2,2016),(59.0133,2,2017),(60.1707,2,2018),(62.6600,2,2019))a(AveragePrice,ProductNumber,Year)
  )a
AveragePrice ProductNumber Year rank N group4 decimal Q
45.7820 2 2015 1 5 0 NULL NULL
46.0142 2 2016 2 5 0 0.000000 46.014200
59.0133 2 2017 3 5 1 0.000000 59.013300
60.1707 2 2018 4 5 2 0.000000 60.170700
62.6600 2 2019 5 5 3 NULL NULL

CodePudding user response:

Okay, inspired by this post I managed to build a query which actually calculates the exact quartiles:

  -- ; since it is being used in a sp
    ;WITH quartile_data AS (
    
        SELECT Price,
             ProductNumber
        FROM (values(29.4785,2,2015),(30.0000,2,2016),(33.4762,2,2017),(35.2917,2,2018),(35.8731,2,2018),(36.2475,2,2018),(37.9790,2,2018),(39.5846,2,2018),(67.4443,2,2018))sales(Price,ProductNumber)
     
    )
   --Aggregate into a single record for each group, using MAX to select the non-null 
   --detail value for each column
   -- ISNULL check to include groups with three values as well
    SELECT ProductNumber, 
        (Max(Q1NextVal) - MAX(Q1Val)) * Max(Q1Frac)   Max(Q1Val) as [Q1],
        (Max(MidVal1)   Max(MidVal2)) / 2 [Median],
        (ISNULL(Max(Q3NextVal),0) - MAX(Q3Val)) * Max(Q3Frac)   Max(Q3Val) as [Q3]
    
    -- save the result into a temp table
    INTO #my_temp_table
    FROM (
--Expose the detail values for only the records at the index values 
--generated by the summary subquery. All other values are left as NULL. som NULL.
        SELECT detail.ProductNumber, 
            CASE WHEN RowNum = Q1Idx THEN Price ELSE NULL END Q1Val,
            CASE WHEN RowNum = Q1Idx   1 THEN Price ELSE NULL END Q1NextVal,
            CASE WHEN RowNum = Q3Idx THEN Price ELSE NULL END Q3Val,
            CASE WHEN RowNum = Q3Idx   1 THEN Price ELSE NULL END Q3NextVal,
            Q1Frac,
            Q3Frac,
            CASE WHEN RowNum = MidPt1 THEN Price ELSE NULL END MidVal1,
            CASE WHEN RowNum = MidPt2 THEN Price ELSE NULL END MidVal2
        FROM
               --Calculate a row number sorted by measure for each group.
            (SELECT *,  ROW_NUMBER() OVER (PARTITION BY ProductNumber ORDER BY Price) RowNum
            FROM  quartile_data) AS detail
    
        INNER JOIN (
    --Summarize to find index numbers and fractions we need to use to locate 
    --the values at the quartile points.
    -- The modulus operator is used to sum the correct number if the number of rows in the group is even or uneven
            SELECT ProductNumber, 
                FLOOR((COUNT(*)   IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) Q1Idx,
                ((COUNT(*)   IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) - FLOOR((COUNT(*)   IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) Q1Frac,
                (COUNT(*)   1) / 2 AS MidPt1,
                (COUNT(*)   2) / 2 AS Midpt2,
                  FLOOR((COUNT(*) * 3   IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) Q3Idx,
                ((COUNT(*) * 3   IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) - FLOOR((COUNT(*) * 3   IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) Q3Frac
            FROM  quartile_data
            GROUP BY ProductNumber
            HAVING COUNT(*) > 1
        
        ) AS summary ON detail.ProductNumber  = summary.ProductNumber
    
    ) AS step_two
    GROUP BY ProductNumber
   -- Include only groups with more than 2 rows
    HAVING count(*) > 2

The following prices: 29.4785 30.0000 33.4762 35.2917 35.8731 36.2475 37.9790 39.5846 67.4443

Gives the correct value: Q1 = 31.7381000000 and Q3 = 38.7818000000

Verified using this online tool

  • Related