Postgresql how to remove duplicated rows by columns value from a table without index or id-CodePudding

I'm using Postgresql,and I have a table named stock,this table has no index or id:

open	high	low	close	volume	datetime
383.97	384.22	383.66	384.08	1298649	2022-12-16 14:25:00
383.59	384.065	383.45	383.98	991327	2022-12-16 14:20:00
383.59	384.065	383.45	383.98	991327	2022-12-16 14:20:00
383.59	384.065	383.45	383.98	991327	2022-12-16 14:20:00
383.64	384.2099	383.54	383.61	1439271	2022-12-16 14:15:00

How can I remove the rows that have duplicated datetime ,and only keep 1 row of it,only keep the latest row of it,by using Postgresql sql ?

The output should be:

open	high	low	close	volume	datetime
383.97	384.22	383.66	384.08	1298649	2022-12-16 14:25:00
383.59	384.065	383.45	383.98	991327	2022-12-16 14:20:00
383.64	384.2099	383.54	383.61	1439271	2022-12-16 14:15:00

Something like:

delete from stock where datetime duplicated > 1

CodePudding user response：

The issue with this operation is that you can't uniquely identify your rows. One possible option to solve this problem is to:

store one row of each of your duplicate records in a temporary table
delete all records having duplicates from the original table
adding duplicate records back from the temporary table to the original table

These steps can be condensed in the following three queries:

CREATE TEMPORARY TABLE deleted_values AS 
SELECT *
FROM tab
GROUP BY open, high, low, close, volume, datetime
HAVING COUNT(*) > 1;

DELETE FROM tab
USING deleted_values
WHERE tab.open = deleted_values.open
  AND tab.high = deleted_values.high
  AND tab.low = deleted_values.low
  AND tab.close = deleted_values.close
  AND tab.datetime = deleted_values.datetime;

INSERT INTO tab 
SELECT * FROM deleted_values;

Check the demo here.

CodePudding user response：

As your table does not contain the primary key you'll have to use ctid

This query reports the duplication "index" - all rows with rn > 1 are duplicates

select ctid, tab.*,
row_number() over (partition by datatime order by open) rn
from tab;

ctid |open|datatime  |rn|
----- ---- ---------- -- 
(0,1)| 300|2022-12-16| 1|
(0,2)| 310|2022-12-16| 2|
(0,3)| 400|2022-12-17| 1|
(0,4)| 500|2022-12-18| 1|
(0,5)| 500|2022-12-18| 2|

Note that you set the partition by to your unique key and with order by you can controll which row will be preserved.

Than you use the ctidof the duplicated row to get rid of them

with t as (
select ctid, tab.*,
row_number() over (partition by datatime order by open) rn
from tab
)
delete from tab where ctid in (
   select ctid from t where rn > 1
);
-- 2 rows deleted

Sample (simplified) data

create table tab as 
select * from (values
(300, DATE'2022-12-16'),
(310, DATE'2022-12-16'),
(400, DATE'2022-12-17'),
(500, DATE'2022-12-18'),
(500, DATE'2022-12-18')
) tab(open, datatime)
;