Let's say we have data as follows:
ID tag data timestamp
001 A walter 2021-06-04 09:46:25
005 F junior 2021-06-05 09:47:25
001 B junior 2021-06-04 09:47:25
002 C soprano 2021-06-04 09:48:25
002 C alto 2021-06-04 09:49:25
001 A brown 2021-06-04 09:50:25
003 A cleave 2021-06-04 09:51:25
003 B land 2021-06-04 09:52:25
004 C before 2021-06-04 09:53:25
005 H junior 2021-06-04 09:47:25
I need to know which tag
appears the most for each of the ID
values. In the instance of a tie, please use the most recent tag for the ID, indicated by timestamp.
Expected Result:
ID tag
001 A
002 C
003 B
004 C
005 F
CodePudding user response:
Using QUALIFY and RANK to filter the result of grouping:
SELECT ID, tag, COUNT(*) AS cnt, MAX(timestamp) AS max_t
FROM tab
GROUP BY ID, tag
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY cnt DESC, max_t DESC) = 1
Sample data:
CREATE OR REPLACE TABLE tab(ID STRING, tag STRING, data STRING, timestamp TIMESTAMP)
AS
SELECT '001', 'A' ,' walter','2021-06-04 09:46:25'
UNION ALL SELECT '005', 'F' ,' junior','2021-06-05 09:47:25'
UNION ALL SELECT '001', 'B' ,' junior','2021-06-04 09:47:25'
UNION ALL SELECT '002', 'C' ,'soprano','2021-06-04 09:48:25'
UNION ALL SELECT '002', 'C' ,' alto','2021-06-04 09:49:25'
UNION ALL SELECT '001', 'A' ,' brown','2021-06-04 09:50:25'
UNION ALL SELECT '003', 'A' ,' cleave','2021-06-04 09:51:25'
UNION ALL SELECT '003', 'B' ,' land','2021-06-04 09:52:25'
UNION ALL SELECT '004', 'C' ,' before','2021-06-04 09:53:25'
UNION ALL SELECT '005', 'H' ,' junior','2021-06-04 09:47:25';
Simplified query:
SELECT ID, tag
FROM tab
GROUP BY ID, tag
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY COUNT(*) DESC, MAX(timestamp) DESC) = 1
ORDER BY ID;
Output: