SQL: Selecting top proportion of column and corresponding attributes in other columns-CodePudding

Say I have data like so

group,      city,  user,
nyc_frisbee, nyc,   joe,
nyc_frisbee, nyc,   ben,
nyc_frisbee,  sf,   tim,
sf_cooking,   sf,   tim,
sf_cooking,  atl,   jon,

My goal is to get

group,       top_city, prob
nyc_firsbee, nyc,      .66
sf_cooking,  sf,       .5
sf_ccoking,  atl,      .5

Edge case: In this table below, nyc_frisbee shows up only once because there is a single top city. However, sf_cooking shows up twice because sf and atl are tied.

How can I accomplish this in SQL?

What I have so far:

SELECT group, city, top/total AS prob
FROM (
     SELECT 
         group, 
         city, 
         freq,
         MAX(freq) OVER(PARTITION BY group) AS top,
         SUM(freq) OVER(PARTITION BY group) AS total
     FROM
         (SELECT group, city, COUNT(city) AS freq
          FROM mytable
          GROUP BY group, city) inner_query
     ) outer_query
WHERE outer.freq = outer.top

This doesn't behave the way that I intended...

group,       top_city, prob
nyc_firsbee, nyc,      .66
nyc_firsbee, sf,       .66
sf_cooking,  sf,       .5
sf_ccoking,  atl,      .5

So I'm getting one row per each unique group/city combination with the same prob value.

CodePudding user response：

How about something like:

WITH x as (
  SELECT group, city, CAST(COUNT(*) AS FLOAT)/SUM(COUNT(*)) OVER(PARTITION BY group) AS prob
  FROM mytable
  GROUP BY group, city
)

SELECT x.*
FROM 
  x 
  INNER JOIN 
  (SELECT group, MAX(prob) maxprob FROM x GROUP BY group) y 
  ON 
    x.group = y.group AND
    x.prob = y.maxprob

We boil our group/cities down to

group,      city,  prob,
nyc_frisbee, nyc,   0.66,
nyc_frisbee,  sf,   0.33,
sf_cooking,   sf,   0.5,
sf_cooking,  atl,   0.5,

and alias as X, then we join it to a grouping of itself on the group, max(prob). Because sf_cooking, 0.5 is the max, the join matches twice, preserving the tie, but nyc's max is 0.66 which matches once, excluding the 0.33

CodePudding user response：

See if the following works for you. First calculating the probablity, then rank the groups ordered by the probablity, finally getting distinct rows.

with x as (
select *,
    Count(*) over(partition by group, City) * 1.0 / Count(*) over(partition by group) prob
from t
), r as (
    select *, dense_rank() over(partition by group order by prob desc) rn
    from x
)
select distinct group, city, prob
from r
where rn=1