Home > database >  Postgres determine top 10 domains by age group (rank group by)
Postgres determine top 10 domains by age group (rank group by)

Time:10-28

Given a user_table with email addresses, we want a 'top 10' list of domains, by age group. So for each group, I should get the top 10 rankings. (i.e. 50 rows).

What I have so far (I'm using Postgres). This seems to get close, but I think tied rankings are getting eaten. I’m not getting back 50 rows. I get back 12, which basically seems to be ranked 1-10 with 2 ties. All the same agegroup. If I increase it to r<=30 then I get back more results (71), including different age groups, but more then 10 per group (10-15 per group).

with users as (
    select a.*, 
      extract(year from age(dob)) age,
      substr(email, position('@' in email) 1, 1000) domain
    from user_table a
   ),
   useragegroup as (
    select a.*,
     case when age between 0 and 18 then '0-18'
          when age between 19 and 29 then '19-29'
          when age between 30 and 49 then '30-49' 
          when age between 50 and 65 then '50-65'
          else '66-up'
     end agegroup
    from users a
   ),
   rank as (
     select agegroup, domain, 
       dense_rank() over (order by count(*) desc) r
     from useragegroup a
     group by agegroup, domain
   )
   select a.*
   from rank a
   where r<=10;

To generate some test date I have: (Change date by 10 years for each group)

insert into user_table (
    first, last, email, dob
)
select
    left(md5(i::text), 3),
    left(md5(random()::text), 3),
    'user_' || i || '@' || (
    CASE (RANDOM() * 14)::INT
      WHEN 0 THEN 'gmail'
      WHEN 1 THEN 'hotmail'
      WHEN 2 THEN 'apple'
      WHEN 3 THEN 'icloud'
      WHEN 4 THEN 'aol'
      WHEN 5 THEN 'usa'
      WHEN 6 THEN 'govt'
      WHEN 7 THEN 'stackoverflow'
      WHEN 8 THEN 'random'
      WHEN 9 THEN 'domain'
      WHEN 10 THEN 'subby'
      WHEN 11 THEN 'youtube'
      WHEN 12 THEN 'google'
      WHEN 13 THEN 'triple'
      WHEN 14 THEN 'pixar'
    END
  ) || '.com' AS email,
    '2005-01-01' as date
from generate_series(1, 500) s(i);

CodePudding user response:

I think because you use dense_rank, you have duplicate rank and the total record has been increasing like below table:

total records: 13 rows

| agegroup | domain             | r  |
| -------- | ------------------ | -- |
| 66-up    | youtube.com        | 1  |
| 66-up    | triple.com         | 2  | <-- duplicate
| 66-up    | google.com         | 2  | <-- duplicate
| 66-up    | random.com         | 3  |
| 66-up    | usa.com            | 4  |
| 66-up    | aol.com            | 5  | <-- duplicate
| 66-up    | subby.com          | 5  | <-- duplicate
| 66-up    | hotmail.com        | 5  | <-- duplicate
| 66-up    | stackoverflow.com  | 6  |
| 66-up    | apple.com          | 7  |
| 66-up    | domain.com         | 8  |
| 66-up    | icloud.com         | 9  |
| 66-up    | govt.com           | 10 |

Your query has two problems:

  1. You should use row_number because dense_rank add duplicate rank and when you use r <= 10 if duplicate r exist in the record, a total record for each group has been an increase

  2. The second problem in windows function, you have to use partition by agegroup for each group because need create rank for each group

with users as (
    select a.*, 
      extract(year from age(dob)) as age,
      substr(email, position('@' in email) 1, 1000) as domain
    from user_table a
   ),
   useragegroup as (
    select a.*,
     case when age between 0 and 18 then '0-18'
          when age between 19 and 29 then '19-29'
          when age between 30 and 49 then '30-49' 
          when age between 50 and 65 then '50-65'
          else '66-up'
     end agegroup
    from users a
   ),
   rank as (
     select agegroup, domain, 
       row_number() over (partition by agegroup order by count(*) desc) r
     from useragegroup a
     group by agegroup, domain
   )
   select a.*
   from rank a
   where r <= 10;

CodePudding user response:

Your query may be just fine. Looks questionable but nothing specific stands out. You do however have a problem. Your expectation to get 50 rows in the results. I would guess this would be very rare. The main things being that neither rank nor dense_rank do not generate unique values, if the value being ranked is the same in multiple rows then each of the rows gets the same RANK. The difference being rank will skip values while dense_rank does not. I.e. if the first 2 rows have the have the same value and the third row a different then the the following holds:

 ------------ ------------- ------ ------------ 
| Row_number | Count_Value | Rank | Dense_Rank |
 ------------ ------------- ------ ------------ 
|          1 |          12 |    1 |          1 |
|          2 |          12 |    1 |          1 |
|          3 |          14 |    3 |          2 |
 ------------ ------------- ------ ------------ 

See demo with ""your data here. It includes columns for rank (rnk) and dense_rank (drnk). Scan down the rnk and/or drnk for the age_group you are interested in then over to the to row_num . That is the number of rows returned for that age_group. Notice that the drnk column does not get to 10 for some age_group; those will return all 15. Provided the random domain selection generated a row for each of the domains. While highly likely there is no guarantee of that.

BTW: My query. I created a table for age_groups, it is also in the demo.

select domain, ag_name, dom_cnt, rnk, drnk
  from ( -- rank each group by iten count
         select domain, ag_name, dom_cnt
              , rank()       over (partition by ag_name order by dom_cnt desc) rnk
              , dense_rank() over (partition by ag_name order by dom_cnt desc) drnk
              , row_number() over (partition by ag_name order by dom_cnt desc) row_num
           from ( -- count #items for each edomain, ag_name 
                  select domain, ag_name ,count(*) dom_cnt 
                    from (-- extract email domain and group name 
                          select substr(email, position('@' in email) 1) as domain, ag.ag_name 
                            from age_groups ag 
                            join user_table ut
                              on (extract(year from age(ut.dob)))::int4  <@ ag.ag_range 
                         ) agdom
                    group by  ag_name, domain
               ) dom_cnt 
        ) dom_rank 
-- where rnk <= 10
;               
  • Related