Home > Blockchain >  How can I select the most recent unique record based on another column for each ID in SQL?
How can I select the most recent unique record based on another column for each ID in SQL?

Time:04-01

So I'm having the following issue, using Snowflake SQL.

I have a table with many columns, including User_ID, Event, USD_Amount, Date.

There are many duplicate columns, and I am thus trying to write a query which will return all the columns in this table, but only only one (most recent) row for each Event of each User_ID.

For example:

User_ID Event Date USD_Amount ....
432980 Closed Account 2021-01-30 15:30:10 ... ....
434092 Logged In 2021-01-30 14:20:15 ... ....
432980 Created Account 2021-01-20 11:45:25 ... ....
432980 Created Account 2021-01-20 11:45:25 ... ....
434092 Created Account 2021-01-19 10:30:05 ... ....
434092 Created Account 2021-01-19 10:30:05 ... ....

I'd want to return:

User_ID Event Date USD_Amount ....
432980 Closed Account 2021-01-30 15:30:10 ... ....
434092 Logged In 2021-01-30 14:20:15 ... ....
432980 Created Account 2021-01-20 11:45:25 ... ....
434092 Created Account 2021-01-19 10:30:05 ... ....
491831 Logged Out 2021-01-19 10:10:00 ... ....
491831 Created Account 2021-01-18 20:18:40 ... ....

I haven't found a similar question on StackOverflow and have only figured out solutions for this, such as using a partition or a sub-select when selecting the most recent row for each User_ID, not taking into account Event or another column. I am not sure how to do this taking into account more than just the User_ID and Date.

Thanks in advance!

CodePudding user response:

You may want to keep these three fields apart, and use an aggregation function like MAX to retrieve the highest value of Date, by grouping on User_ID and Event.

Once you have selected these unique three values, you could join back this selection with your original table and get the rest of the fields.

I would lay down something like this:

SELECT 
    max_dates.*,
    tab.<col1>,
    tab.<col2>,
    ...
FROM 
    (
    SELECT 
        User_ID,
        Event,
        MAX(Date) AS Date
    FROM 
        tab
    GROUP BY
        User_ID,
        Event
    ) max_dates
INNER JOIN 
    tab
ON 
    max_dates.User_ID = tab.User_ID 
AND
    max_dates.Event = tab.Event
AND
    max_dates.Date = tab.Date

Another option that can avoid issues when there are two rows with the same date can be using the ROW_NUMBER window function to uniquely identify each Date value, yet exploiting the partitioning on User_ID and Event:

WITH ranked_dates AS (
    SELECT
        tab.*,
        (ROW_NUMBER() OVER(PARTITION BY (User_ID, Event) ORDER BY Date DESC)) rank
    FROM    
        tab
)
SELECT 
    *
FROM 
    ranked_dates
WHERE 
    rank = 1

CodePudding user response:

QUALIFY allows you to run a filter after all the other stages of the WHERE/GROUPBY have completed. So pairing that with ROW_NUMBER() partitioned over the partitions, and order in descending, so the newest date is the first, we can the "select to keep" only that row:

SELECT *
FROM table_name
QUALIFY row_number() over (partition by User_ID, Event order by date desc) = 1;

Which is the same as Lemon's second example thus WHERE rank = 1, but more compact.

  • Related