selecting top 3 values (based on their count) from each category-CodePudding

I'm trying to get the top 3 rows from col2 for each letter in col1 based on the count of col2

My table looks like this:

col1	col2
A	1
B	2
A	2
B	2
B	1
B	1
B	1
A	3
A	2
B	4
A	2
B	2
A	3
A	4

So for example, here I have that A=1 one time, A=2 three times and A=3 two times (similar for B).

I would like my output to look like this:

col1	col2
A	2
A	3
A	1
B	1
B	2
B	4

Where it shows me the top 3 values (col2) (based on the count) of A for col1. Same for B.

I tried doing:

SELECT col1, col2
FROM data
WHERE col2 = (SELECT COUNT(col2)
                    FROM data
                    WHERE col1 = data.col1
                    ORDER BY COUNT(col2) desc
                    LIMIT(3)
                    )
GROUP BY col1, col2

but I get an empty table

I'm also trying:

SELECT col1, col2, rn
FROM (SELECT col1, col2,
      ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY COUNT(col2) DESC) AS rn
    FROM data) tmp 
WHERE rn <= 3
ORDER BY col1, rn

but it doesn't work (I think the problem is when I tell it to count).

Any ideas? I'm pretty new with SQL and can't figure it out. I could also do it using dataframes (pyspark) but I'm not sure what the equivalent for OVER PARTITION would be in API dataframe..

CodePudding user response：

UPDATE: I figured it out. This is the correct query:

SELECT col1, col2, x FROM (SELECT col1, col2, count(col2) AS x, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY count(col2) DESC) AS rn FROM data GROUP BY col1, col2) tmp WHERE rn <= 3 ORDER BY col1

CodePudding user response：

try this

SELECT col1,col2
FROM (
    SELECT col1,col2,
           row_number() over(PARTITION BY col1 ORDER BY cnt DESC) as rn
    FROM (
        SELECT col1,col2,COUNT(1) as cnt 
        FROM `table`
        group by col1,col2
    ) t0
) t1
WHERE rn <= 3