I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.
Dataset looks like:
household_id | household_size |
---|---|
x1 | 5 |
x2 | 1 |
x3 | 8 |
... | ... |
I then have a set of percentages for each age group with that dataset looking like:
age_group | percentage |
---|---|
0-18 | 30 |
19-30 | 40 |
31-100 | 30 |
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id | household_size | 0-18 | 19-30 | 31-100 |
---|---|---|---|---|
x1 | 5 | 2 | 2 | 1 |
x2 | 1 | 0 | 1 | 0 |
x3 | 8 | 3 | 3 | 2 |
... | ... | ... | .... | ... |
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.
CodePudding user response:
A minimum start could be:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
output:
household_id | size | g1 | g2 | g3 |
---|---|---|---|---|
x1 | 5 | 1.5 | 2.0 | 1.5 |
x2 | 1 | 0.3 | 0.4 | 0.3 |
x3 | 8 | 2.4 | 3.2 | 2.4 |
see: DBFIDDLE
Notes:
- Of course you should round the columns
g
to whole numbers, taking into account the complete split (g1 g2 g3 = total) - Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )
CodePudding user response:
Enjoy this slower, less readable, worse version of depesz weighted_random
, but in plain sql:
CREATE FUNCTION weighted_random(IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
with recursive
setup as (
select ( random() * (SELECT sum(x) AS v_weight_sum
FROM unnest(p_weights) as q(x) ) )
as v_random )
,cte as (
select v_random,
2::int4 as next_v_i
from setup
union all
select v_random-p_weights[next_v_i-1] as v_random,
next_v_i 1 as next_v_i
from cte
where (next_v_i)<=array_upper(p_weights, 1)
)
select p_choices[next_v_i-1]
from cte
where v_random < p_weights[next_v_i-1]
$$;
The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
Online demo showing that both his version and mine are statistically reliable.
I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify his plpgsql version to plain sql. Clearly failed, had fun nonetheless, so there you go.