Creating a VIEW to remove duplicates from table based on max date-CodePudding

I have a table that appends data each day and records the imported date, however it appends duplicates.

My end goal here is to remove the duplicates based on the lowest imported column date.

Here would be the initial state of that table:

TABLE CLIENTS

Name	Surname	Imported
Bob	John	18-07-2022
Marta	White	18-07-2022
Ryan	Max	18-07-2022
Bob	John	20-07-2022
Marta	White	20-07-2022
Ryan	Max	20-07-2022
Brian	Red	20-07-2022

Desired state:

Name	Surname	Imported
Bob	John	18-07-2022
Marta	White	18-07-2022
Ryan	Max	18-07-2022
Brian	Red	20-07-2022

I figured if I could create a view and match on all rows, where I would remove the MAX(imported) columns? Although not sure if this is the correct approach.

Once I create the view with the MAX columns:

CREATE VIEW CLIENTS_VIEW AS
SELECT * FROM CLIENTS
WHERE "Imported" = ( SELECT MAX("Imported") FROM CLIENTS);

Although Not sure how I could now match the view on the original table and only keep the lowest values per Imported column?

CodePudding user response：

one way you could try is using the analytical function,

select * from (
select column1 as Name, column2 as Surname, column3 as Imported, row_number() over (partition by Name, surname order by imported desc) rnum from values
('Bob','John','18-07-2022'),
 ('Marta','White','18-07-2022'),
 ('Ryan','Max','18-07-2022'),  
('Bob','John','20-07-2022'),  
('Marta','White','20-07-2022'), 
('Ryan','Max','20-07-2022'),  
('Brian','Red','20-07-2022')
  ) where rnum = 1;

CodePudding user response：

As mentioned in my comment, ideally each client would have a unique identifier. Lacking that, I am going to use name||'_'||surname as a pseudo primary key.

There's a couple of approaches you could use here

The first is using a subquery to join on the key and imported date

CREATE VIEW CLIENTS_VIEW AS
SELECT C.* FROM CLIENTS C
JOIN
    (
SELECT 
name||'_'||surname as client_name
, MAX(imported) as latest
FROM CLIENTS
GROUP BY 1
     ) MI ON MI.client_name = C.name||'_'||surname AND MI.latest = C.imported

Another would be to use a row number function as per the other answer

CREATE VIEW CLIENTS_VIEW AS
SELECT C.* FROM CLIENTS C
QUALIFY row_number() over (partition by Name, surname order by imported desc)=1

In my experience, the subquery one is more performant if the amount of data is large

There is other alternatives, for example using NOT EXISTS, joining back onto the same table or using a CTE

The most performant option for larger tables would be to create another table with the latest data for each client (again a unique identifier would be needed) and periodically use MERGE to upsert new data.

Something like this

merge into clients_latest cl using (select * from clients) as c on 
cl.name||'_'||surname = c.name||'_'||surname
when matched then update set cl.imported = c.imported
when not matched then insert (name, surname, imported) values (c.name, 
c.surname, c.imported);

If this data is changed infrequently then a semi regular scheduled task could run this for you. If the table is constantly being appended then an append only table stream might be a quicker option as you would only then be upserting the new data since the last upsert