I have a response_data_v0 that contains several duplicates of the UUID values that I want to ignore, keeping just the value that contains the oldest date (In other words, I only want the row with the first appearance of that specific UUID)
I built a temporary table using "with" and I filter it by getting "min(uuid)" as a column just to get unique values, then I created a second one that counts those values.
Given that the values of the UUID must be unique (following my logic), I created a validation column "excess_data" to test my hypotheses. All the values of "excess_data" should be = 0
if I am not getting duplicates in the first table given that
count(uuid) = count(distinct uuid)
In this specific case.
BUT, that is not happening, "excess_data" > 0
in all my results.
What am I doing wrong??
with unique_values as (SELECT
min(uuid) as uuid,
url,
day,
month,
year
--response_data
FROM "data_lake"."response_data_v0"
group by url, day, month, year
--order by uuid
)
SELECT
count(uuid) as count_uuids,
count(distinct uuid) as count_unique_uuids,
count(uuid) - count(distinct uuid) as excess_data,
month,
year
FROM unique_values
group by year, month
order by year, month
CodePudding user response:
I would argue that the problem here is the different group clauses used to filter out duplicates and to verify that filtering. Unless business logic of the app creating the data prevents the same uuid
from appearing on different days the observed behavior pretty much is bound to happen - unique_values
grouping clause includes day
, while the final select - does not. Either add day
to the result query group by clause or remove it from unique_values
's.
CodePudding user response:
PrestoDB/Trino have a min_by
aggregate function that returns the value of a column associated with the minimum value of another column (of the group). min_by(uuid, my_date)
would return the value of the uuid
column of the row with the minimum value of my_date
in the group. Just assemble a string from your year, month, day columns and use that in place of my_date
.