How can I select the most recent unique record based on another column for each ID in SQL?-CodePudding

So I'm having the following issue, using Snowflake SQL.

I have a table with many columns, including User_ID, Event, USD_Amount, Date.

There are many duplicate columns, and I am thus trying to write a query which will return all the columns in this table, but only only one (most recent) row for each Event of each User_ID.

For example:

User_ID	Event	Date	USD_Amount	....
432980	Closed Account	2021-01-30 15:30:10	...	....
434092	Logged In	2021-01-30 14:20:15	...	....
432980	Created Account	2021-01-20 11:45:25	...	....
432980	Created Account	2021-01-20 11:45:25	...	....
434092	Created Account	2021-01-19 10:30:05	...	....
434092	Created Account	2021-01-19 10:30:05	...	....

I'd want to return:

User_ID	Event	Date	USD_Amount	....
432980	Closed Account	2021-01-30 15:30:10	...	....
434092	Logged In	2021-01-30 14:20:15	...	....
432980	Created Account	2021-01-20 11:45:25	...	....
434092	Created Account	2021-01-19 10:30:05	...	....
491831	Logged Out	2021-01-19 10:10:00	...	....
491831	Created Account	2021-01-18 20:18:40	...	....

I haven't found a similar question on StackOverflow and have only figured out solutions for this, such as using a partition or a sub-select when selecting the most recent row for each User_ID, not taking into account Event or another column. I am not sure how to do this taking into account more than just the User_ID and Date.

Thanks in advance!

CodePudding user response：

You may want to keep these three fields apart, and use an aggregation function like MAX to retrieve the highest value of Date, by grouping on User_ID and Event.

Once you have selected these unique three values, you could join back this selection with your original table and get the rest of the fields.

I would lay down something like this:

SELECT 
    max_dates.*,
    tab.<col1>,
    tab.<col2>,
    ...
FROM 
    (
    SELECT 
        User_ID,
        Event,
        MAX(Date) AS Date
    FROM 
        tab
    GROUP BY
        User_ID,
        Event
    ) max_dates
INNER JOIN 
    tab
ON 
    max_dates.User_ID = tab.User_ID 
AND
    max_dates.Event = tab.Event
AND
    max_dates.Date = tab.Date

Another option that can avoid issues when there are two rows with the same date can be using the ROW_NUMBER window function to uniquely identify each Date value, yet exploiting the partitioning on User_ID and Event:

WITH ranked_dates AS (
    SELECT
        tab.*,
        (ROW_NUMBER() OVER(PARTITION BY (User_ID, Event) ORDER BY Date DESC)) rank
    FROM    
        tab
)
SELECT 
    *
FROM 
    ranked_dates
WHERE 
    rank = 1

CodePudding user response：

QUALIFY allows you to run a filter after all the other stages of the WHERE/GROUPBY have completed. So pairing that with ROW_NUMBER() partitioned over the partitions, and order in descending, so the newest date is the first, we can the "select to keep" only that row:

SELECT *
FROM table_name
QUALIFY row_number() over (partition by User_ID, Event order by date desc) = 1;

Which is the same as Lemon's second example thus WHERE rank = 1, but more compact.