I'm trying to get the top 3 rows from col2 for each letter in col1 based on the count of col2
My table looks like this:
col1 | col2 |
---|---|
A | 1 |
B | 2 |
A | 2 |
B | 2 |
B | 1 |
B | 1 |
B | 1 |
A | 3 |
A | 2 |
B | 4 |
A | 2 |
B | 2 |
A | 3 |
A | 4 |
So for example, here I have that A=1 one time, A=2 three times and A=3 two times (similar for B).
I would like my output to look like this:
col1 | col2 |
---|---|
A | 2 |
A | 3 |
A | 1 |
B | 1 |
B | 2 |
B | 4 |
Where it shows me the top 3 values (col2) (based on the count) of A for col1. Same for B.
I tried doing:
SELECT col1, col2
FROM data
WHERE col2 = (SELECT COUNT(col2)
FROM data
WHERE col1 = data.col1
ORDER BY COUNT(col2) desc
LIMIT(3)
)
GROUP BY col1, col2
but I get an empty table
I'm also trying:
SELECT col1, col2, rn
FROM (SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY COUNT(col2) DESC) AS rn
FROM data) tmp
WHERE rn <= 3
ORDER BY col1, rn
but it doesn't work (I think the problem is when I tell it to count).
Any ideas? I'm pretty new with SQL and can't figure it out. I could also do it using dataframes (pyspark) but I'm not sure what the equivalent for OVER PARTITION would be in API dataframe..
CodePudding user response:
UPDATE: I figured it out. This is the correct query:
SELECT col1, col2, x FROM (SELECT col1, col2, count(col2) AS x, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY count(col2) DESC) AS rn FROM data GROUP BY col1, col2) tmp WHERE rn <= 3 ORDER BY col1
CodePudding user response:
try this
SELECT col1,col2
FROM (
SELECT col1,col2,
row_number() over(PARTITION BY col1 ORDER BY cnt DESC) as rn
FROM (
SELECT col1,col2,COUNT(1) as cnt
FROM `table`
group by col1,col2
) t0
) t1
WHERE rn <= 3