Home > Software engineering >  Removing dublicate on PostgreSQL based on 2 columns
Removing dublicate on PostgreSQL based on 2 columns

Time:04-03

I want to drop entries based on 2 columns in Postgres. The code for to get the data is:

import pandas as pd
from sqlalchemy import create_engine
from pandas_datareader import data as web
import yfinance as yf
import datetime
from dateutil.relativedelta import relativedelta

engine = create_engine('postgresql://postgres:{}@localhost:5433/stockdata'.format(123))
tickers = ['amzn', 'fb']
for ticker in tickers:
    df = yf.download(tickers=ticker, start=datetime.datetime.now()-relativedelta(days=7), end= datetime.datetime.now(), interval="5m", progress = False)
    df['symbol'] = ticker
    df.to_sql('stockdata', con=engine, if_exists='append', index=True)

This will produce a table which looks like this but running the code multiple times will create duplicates:

    Datetime    Open    High    Low     Close   Adj Close   Volume  symbol
0   2022-03-28 13:30:00 00:00   3299.500000     3318.000000     3298.000000     3311.023438     3311.023438     133898  amzn
1   2022-03-28 13:35:00 00:00   3313.000000     3344.479980     3313.000000     3339.060059     3339.060059     104565  amzn
2   2022-03-28 13:40:00 00:00   3348.500000     3365.989990     3342.344482     3354.570068     3354.570068     126489  amzn
3   2022-03-28 13:45:00 00:00   3356.000000     3356.000000     3340.000000     3340.000000     3340.000000     70492   amzn
4   2022-03-28 13:50:00 00:00   3338.500000     3348.669922     3337.385010     3347.584961     3347.584961     64041   amzn
...     ...     ...     ...     ...     ...     ...     ...     ...
1559    2022-04-01 19:40:00 00:00   223.380005  223.660004  223.039993  223.110001  223.110001  258564  fb
1560    2022-04-01 19:45:00 00:00   223.080002  223.589996  223.059998  223.570007  223.570007  246820  fb
1561    2022-04-01 19:50:00 00:00   223.600006  224.610001  223.600006  224.485001  224.485001  469905  fb
1562    2022-04-01 19:55:00 00:00   224.460007  224.940002  224.460007  224.869995  224.869995  651629  fb
1563    2022-04-01 20:00:00 00:00   224.850006  224.850006  224.850006  224.850006  224.850006  0   fb

1564 rows × 8 columns

There is no primary key in the database(I don't have any control over that). Thus the objective is to make sure that Datetime is unique to the Symbol. For example FB can only have 1 instance of unique Datetime entry. The data is on a 5m resolution. How can I group by on Datetime and Symbol and remove duplicate entry based on column Datetime

I was trying to do this earlier:

query = """delete from stockdata s
using 
(SELECT "Datetime" , max(ctid) as max_ctid from stockdata group by "Datetime") t
where s.ctid<>t.max_ctid
and s."Datetime"=t."Datetime";"""

But this was looking at just Datetime duplicates and not based on grouping Symbol symbol column as well.

To get the a list of duplicate:

SELECT "symbol", "Datetime", COUNT(*) AS CNT
FROM stockdata
GROUP BY "symbol", "Datetime"
HAVING COUNT(*) > 1;

How can I remove all duplicates and keep 1. Any help would be appreciated!

CodePudding user response:

Revised query. Test before using in production:

query = """delete from stockdata s
using 
(SELECT "symbol", "Datetime" , max(ctid) as max_ctid from stockdata group by "symbol","Datetime") t
where s.ctid <> t.max_ctid
and s."Datetime" = t."Datetime"
and s."Symbol" = t."Symbol";"""

Also from my comments on previous question, it would be better to add a UNIQUE constraint on the receiving table to prevent duplicates from be entered in the first place.

  • Related