Home > Blockchain >  SQL: Selecting top proportion of column and corresponding attributes in other columns
SQL: Selecting top proportion of column and corresponding attributes in other columns

Time:12-02

Say I have data like so

group,      city,  user,
nyc_frisbee, nyc,   joe,
nyc_frisbee, nyc,   ben,
nyc_frisbee,  sf,   tim,
sf_cooking,   sf,   tim,
sf_cooking,  atl,   jon, 

My goal is to get

group,       top_city, prob
nyc_firsbee, nyc,      .66
sf_cooking,  sf,       .5
sf_ccoking,  atl,      .5

Edge case: In this table below, nyc_frisbee shows up only once because there is a single top city. However, sf_cooking shows up twice because sf and atl are tied.

How can I accomplish this in SQL?

What I have so far:

SELECT group, city, top/total AS prob
FROM (
     SELECT 
         group, 
         city, 
         freq,
         MAX(freq) OVER(PARTITION BY group) AS top,
         SUM(freq) OVER(PARTITION BY group) AS total
     FROM
         (SELECT group, city, COUNT(city) AS freq
          FROM mytable
          GROUP BY group, city) inner_query
     ) outer_query
WHERE outer.freq = outer.top 

This doesn't behave the way that I intended...

group,       top_city, prob
nyc_firsbee, nyc,      .66
nyc_firsbee, sf,       .66
sf_cooking,  sf,       .5
sf_ccoking,  atl,      .5

So I'm getting one row per each unique group/city combination with the same prob value.

CodePudding user response:

How about something like:

WITH x as (
  SELECT group, city, CAST(COUNT(*) AS FLOAT)/SUM(COUNT(*)) OVER(PARTITION BY group) AS prob
  FROM mytable
  GROUP BY group, city
)

SELECT x.*
FROM 
  x 
  INNER JOIN 
  (SELECT group, MAX(prob) maxprob FROM x GROUP BY group) y 
  ON 
    x.group = y.group AND
    x.prob = y.maxprob 

We boil our group/cities down to

group,      city,  prob,
nyc_frisbee, nyc,   0.66,
nyc_frisbee,  sf,   0.33,
sf_cooking,   sf,   0.5,
sf_cooking,  atl,   0.5, 

and alias as X, then we join it to a grouping of itself on the group, max(prob). Because sf_cooking, 0.5 is the max, the join matches twice, preserving the tie, but nyc's max is 0.66 which matches once, excluding the 0.33

CodePudding user response:

See if the following works for you. First calculating the probablity, then rank the groups ordered by the probablity, finally getting distinct rows.

with x as (
select *,
    Count(*) over(partition by group, City) * 1.0 / Count(*) over(partition by group) prob
from t
), r as (
    select *, dense_rank() over(partition by group order by prob desc) rn
    from x
)
select distinct group, city, prob
from r
where rn=1
  • Related