Assign an age to a person based on known population average but no Date of birth-CodePudding

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.

Dataset looks like:

household_id	household_size
x1	5
x2	1
x3	8
...	...

I then have a set of percentages for each age group with that dataset looking like:

age_group	percentage
0-18	30
19-30	40
31-100	30

I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:

household_id	household_size	0-18	19-30	31-100
x1	5	2	2	1
x2	1	0	1	0
x3	8	3	3	2
...	...	...	....	...

I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.

CodePudding user response：

A minimum start could be:

SELECT
  household_id,
  MIN(household_size) as size,
  ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
  ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
  ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT 
  h.household_id,
  h.household_size,
  p.agegroup_from,
  p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;

output:

household_id	size	g1	g2	g3
x1	5	1.5	2.0	1.5
x2	1	0.3	0.4	0.3
x3	8	2.4	3.2	2.4

see: DBFIDDLE

Notes:

Of course you should round the columns g to whole numbers, taking into account the complete split (g1 g2 g3 = total)
Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )

CodePudding user response：

Enjoy this slower, less readable, worse version of depesz weighted_random, but in plain sql:

CREATE FUNCTION weighted_random(IN p_choices ANYARRAY, IN p_weights float8[] ) 
RETURNS ANYELEMENT language sql as $$
with recursive 
  setup as (
    select ( random() * (SELECT sum(x) AS v_weight_sum 
                         FROM unnest(p_weights) as q(x) ) ) 
           as v_random )
  ,cte as (
    select v_random,
           2::int4 as next_v_i
    from setup
    union all
    select v_random-p_weights[next_v_i-1] as v_random,
           next_v_i   1                   as next_v_i
    from   cte
    where  (next_v_i)<=array_upper(p_weights, 1)
)
select p_choices[next_v_i-1]
from cte
where v_random < p_weights[next_v_i-1]
$$;

The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):

update test_area t
set ("0-18",
     "19-30",
     "31-100")
=   (with cte AS (
      select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}') 
                as age_group
      from generate_series(1,household_size,1))
     select count(*) filter (where age_group='0-18')   as "0-18",
            count(*) filter (where age_group='19-30')  as "19-30",
            count(*) filter (where age_group='31-100') as "31-100"
     from cte)
returning *;

Online demo showing that both his version and mine are statistically reliable.

I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify his plpgsql version to plain sql. Clearly failed, had fun nonetheless, so there you go.