Problem statement is to calculate median from a table that has two columns. One specifying a number and the other column specifying the frequency of the number. For e.g. Table "Numbers":
Num | Freq |
---|---|
1 | 3 |
2 | 3 |
This median needs to be found for the flattened array with values: 1,1,1,2,2,2
Query:
with ct1 as
(select num,frequency, sum(frequency) over(order by num) as sf from numbers o)
select case when count(num) over(order by num) = 1 then num
when count(num) over (order by num) > 1 then sum(num)/2 end median
from ct1 b where sf <= (select max(sf)/2 from ct1) or (sf-frequency) <= (select max(sf)/2 from ct1)
Is it not possible to use count(num) over(order by num) as the condition in the case statement?
CodePudding user response:
Find the relevant row / 2 rows based of the accumulated frequencies, and take the average of num
.
The linked Fiddle will also show you the
computations leading to the result.
- If you already know that
num
is unique, you can eliminate the 1st CTE. - If you have an ID column, it can be added to the ORDER BY clauses, and once again the 1st CTE can be eliminated.
with
t1 as
(
select num
,sum(freq) as freq
from t
group by num
),
t2 as
(
select t1.*
,nvl(sum(freq) over (order by num, freq rows between unbounded preceding and 1 preceding),0) as freq_acc_sum_1
,sum(freq) over (order by num, freq) as freq_acc_sum_2
,sum(freq) over () as freq_sum
from t1
)
select avg(num)
from t2
where freq_sum/2 between freq_acc_sum_1 and freq_acc_sum_2
Example:
NUM | FREQ | FREQ_ACC_SUM_1 | FREQ_ACC_SUM_2 | FREQ_SUM | RELEVANT_RECORD |
---|---|---|---|---|---|
20 | 2 | 0 | 2 | 30 | |
24 | 2 | 2 | 4 | 30 | |
28 | 3 | 4 | 7 | 30 | |
31 | 3 | 7 | 10 | 30 | |
42 | 2 | 10 | 12 | 30 | |
48 | 2 | 12 | 14 | 30 | |
55 | 1 | 14 | 15 | 30 | V |
69 | 5 | 15 | 20 | 30 | V |
71 | 5 | 20 | 25 | 30 | |
73 | 5 | 25 | 30 | 30 |
MEDIAN |
---|
62 |
CodePudding user response:
You can find the one (or two) middle value(s) and then average:
SELECT AVG(num) AS median
FROM (
SELECT num,
freq,
SUM(freq) OVER (ORDER BY num) AS cum_freq,
(SUM(freq) OVER () 1)/2 AS median_freq
FROM table_name
)
WHERE cum_freq - freq < median_freq
AND median_freq < cum_freq 1
Or, expand the values using a LATERAL
join to a hierarchical query and then use the MEDIAN
function:
SELECT MEDIAN(num) AS median
FROM table_name t
CROSS JOIN LATERAL (
SELECT LEVEL
FROM DUAL
WHERE freq > 0
CONNECT BY LEVEL <= freq
)
Which, for the sample data:
CREATE TABLE table_name (Num, Freq) AS
SELECT 1, 3 FROM DUAL UNION ALL
SELECT 2, 3 FROM DUAL;
Outputs:
MEDIAN 1.5
(Note: For your sample data, there are 6 items, an even number, so the MEDIAN
will be half way between the value of 3rd and 4rd items; so half way between 1 and 2 = 1.5.)
db<>fiddle here