So I'm having the following issue, using Snowflake SQL.
I have a table with many columns, including User_ID, Event, USD_Amount, Date.
There are many duplicate columns, and I am thus trying to write a query which will return all the columns in this table, but only only one (most recent) row for each Event of each User_ID.
For example:
User_ID | Event | Date | USD_Amount | .... |
---|---|---|---|---|
432980 | Closed Account | 2021-01-30 15:30:10 | ... | .... |
434092 | Logged In | 2021-01-30 14:20:15 | ... | .... |
432980 | Created Account | 2021-01-20 11:45:25 | ... | .... |
432980 | Created Account | 2021-01-20 11:45:25 | ... | .... |
434092 | Created Account | 2021-01-19 10:30:05 | ... | .... |
434092 | Created Account | 2021-01-19 10:30:05 | ... | .... |
I'd want to return:
User_ID | Event | Date | USD_Amount | .... |
---|---|---|---|---|
432980 | Closed Account | 2021-01-30 15:30:10 | ... | .... |
434092 | Logged In | 2021-01-30 14:20:15 | ... | .... |
432980 | Created Account | 2021-01-20 11:45:25 | ... | .... |
434092 | Created Account | 2021-01-19 10:30:05 | ... | .... |
491831 | Logged Out | 2021-01-19 10:10:00 | ... | .... |
491831 | Created Account | 2021-01-18 20:18:40 | ... | .... |
I haven't found a similar question on StackOverflow and have only figured out solutions for this, such as using a partition or a sub-select when selecting the most recent row for each User_ID, not taking into account Event or another column. I am not sure how to do this taking into account more than just the User_ID and Date.
Thanks in advance!
CodePudding user response:
You may want to keep these three fields apart, and use an aggregation function like MAX
to retrieve the highest value of Date
, by grouping on User_ID
and Event
.
Once you have selected these unique three values, you could join back this selection with your original table and get the rest of the fields.
I would lay down something like this:
SELECT
max_dates.*,
tab.<col1>,
tab.<col2>,
...
FROM
(
SELECT
User_ID,
Event,
MAX(Date) AS Date
FROM
tab
GROUP BY
User_ID,
Event
) max_dates
INNER JOIN
tab
ON
max_dates.User_ID = tab.User_ID
AND
max_dates.Event = tab.Event
AND
max_dates.Date = tab.Date
Another option that can avoid issues when there are two rows with the same date can be using the ROW_NUMBER
window function to uniquely identify each Date
value, yet exploiting the partitioning on User_ID
and Event
:
WITH ranked_dates AS (
SELECT
tab.*,
(ROW_NUMBER() OVER(PARTITION BY (User_ID, Event) ORDER BY Date DESC)) rank
FROM
tab
)
SELECT
*
FROM
ranked_dates
WHERE
rank = 1
CodePudding user response:
QUALIFY allows you to run a filter after all the other stages of the WHERE/GROUPBY have completed. So pairing that with ROW_NUMBER() partitioned over the partitions, and order in descending, so the newest date is the first, we can the "select to keep" only that row:
SELECT *
FROM table_name
QUALIFY row_number() over (partition by User_ID, Event order by date desc) = 1;
Which is the same as Lemon's second example thus WHERE rank = 1
, but more compact.