Home > Net >  Removing duplicatated rows in PostgreSQL
Removing duplicatated rows in PostgreSQL

Time:04-02

I have thousands of lines of duplicate data in PostgreSQL database. To find out which row are duplicated, I am using this code:

SELECT "Date" FROM stockdata
group by "Date"
having count("Date")>1

This has produced again thousands of lines of date column which have more then 1 entry. How can I remove the row with the date so that just 1 entry of the duplicated item remains.

P.S I cannot use a Primary Key when entering data.

Update

As per the comment. There is no primary key. Also the Date is unique thus there cannot be 2 or more of it. df look like this:

    Date    High    Low     Open    Close   Volume  Adj Close
0   2017-04-03  893.489990  885.419983  888.000000  891.510010  3422300     891.510010
1   2017-04-04  908.539978  890.280029  891.500000  906.830017  4984700     906.830017
2   2017-04-05  923.719971  905.619995  910.820007  909.280029  7508400     909.280029
3   2017-04-06  917.190002  894.489990  913.799988  898.280029  6344100     898.280029
4   2017-04-07  900.090027  889.309998  899.650024  894.880005  3710900     894.880005
...     ...     ...     ...     ...     ...     ...     ...
12595   2022-03-28  1097.880005     1053.599976     1065.099976     1091.839966     34168700    1091.839966
12596   2022-03-29  1114.770020     1073.109985     1107.989990     1099.569946     24538300    1099.569946
12597   2022-03-30  1113.949951     1084.000000     1091.170044     1093.989990     19955000    1093.989990
12598   2022-03-31  1103.140015     1076.640015     1094.569946     1077.599976     16265600    1077.599976
12599   2022-04-01  1094.750000     1066.640015     1081.150024     1076.352783     11449987    1076.352783

12600 rows × 7 columns
  1. The data is repeated a few times at places.
  2. However the rows with the same date with have the same data.
  3. This data is not a stock data (i am using it as a troubleshoot example) but from yokogawa datalogger. https://www.yokogawa.com/in/solutions/products-platforms/data-acquisition/data-logger/#Overview There are redundancies in the system and the earlier integrator had just dumped all the data on 1 database and thus if redundant logger comes online, the database has multiple entries. I need to remove it so we can actually use the data. I don't have access to their software.

Further Update:

Using this code as suggested in the comments:

delete from stockdata s
using 
(SELECT "Date" , max(ctid) as max_ctid from stockdata group by "Date") t
where s.ctid<>t.max_ctid
and s."Date"=t."Date";

It was able to do the job but going forward, is this dangerous solution for production?

CodePudding user response:

This should do the trick:

 DELETE FROM
     stockdata a
         USING stockdata b
 WHERE
     a.id < b.id
     AND a.Date = b.Date;

But be careful, this will immediately delete all duplicates. There is no way to restore them.

  • Related