I'm using Postgresql,and I have a table named stock,this table has no index or id:
open | high | low | close | volume | datetime |
---|---|---|---|---|---|
383.97 | 384.22 | 383.66 | 384.08 | 1298649 | 2022-12-16 14:25:00 |
383.59 | 384.065 | 383.45 | 383.98 | 991327 | 2022-12-16 14:20:00 |
383.59 | 384.065 | 383.45 | 383.98 | 991327 | 2022-12-16 14:20:00 |
383.59 | 384.065 | 383.45 | 383.98 | 991327 | 2022-12-16 14:20:00 |
383.64 | 384.2099 | 383.54 | 383.61 | 1439271 | 2022-12-16 14:15:00 |
How can I remove the rows that have duplicated datetime ,and only keep 1 row of it,only keep the latest row of it,by using Postgresql sql ?
The output should be:
open | high | low | close | volume | datetime |
---|---|---|---|---|---|
383.97 | 384.22 | 383.66 | 384.08 | 1298649 | 2022-12-16 14:25:00 |
383.59 | 384.065 | 383.45 | 383.98 | 991327 | 2022-12-16 14:20:00 |
383.64 | 384.2099 | 383.54 | 383.61 | 1439271 | 2022-12-16 14:15:00 |
Something like:
delete from stock where datetime duplicated > 1
CodePudding user response:
The issue with this operation is that you can't uniquely identify your rows. One possible option to solve this problem is to:
- store one row of each of your duplicate records in a temporary table
- delete all records having duplicates from the original table
- adding duplicate records back from the temporary table to the original table
These steps can be condensed in the following three queries:
CREATE TEMPORARY TABLE deleted_values AS
SELECT *
FROM tab
GROUP BY open, high, low, close, volume, datetime
HAVING COUNT(*) > 1;
DELETE FROM tab
USING deleted_values
WHERE tab.open = deleted_values.open
AND tab.high = deleted_values.high
AND tab.low = deleted_values.low
AND tab.close = deleted_values.close
AND tab.datetime = deleted_values.datetime;
INSERT INTO tab
SELECT * FROM deleted_values;
Check the demo here.
CodePudding user response:
As your table does not contain the primary key you'll have to use ctid
This query reports the duplication "index" - all rows with rn > 1
are duplicates
select ctid, tab.*,
row_number() over (partition by datatime order by open) rn
from tab;
ctid |open|datatime |rn|
----- ---- ---------- --
(0,1)| 300|2022-12-16| 1|
(0,2)| 310|2022-12-16| 2|
(0,3)| 400|2022-12-17| 1|
(0,4)| 500|2022-12-18| 1|
(0,5)| 500|2022-12-18| 2|
Note that you set the partition by
to your unique key and with order by
you can controll which row will be preserved.
Than you use the ctid
of the duplicated row to get rid of them
with t as (
select ctid, tab.*,
row_number() over (partition by datatime order by open) rn
from tab
)
delete from tab where ctid in (
select ctid from t where rn > 1
);
-- 2 rows deleted
Sample (simplified) data
create table tab as
select * from (values
(300, DATE'2022-12-16'),
(310, DATE'2022-12-16'),
(400, DATE'2022-12-17'),
(500, DATE'2022-12-18'),
(500, DATE'2022-12-18')
) tab(open, datatime)
;