I want to drop entries based on 2 columns in Postgres. The code for to get the data is:
import pandas as pd
from sqlalchemy import create_engine
from pandas_datareader import data as web
import yfinance as yf
import datetime
from dateutil.relativedelta import relativedelta
engine = create_engine('postgresql://postgres:{}@localhost:5433/stockdata'.format(123))
tickers = ['amzn', 'fb']
for ticker in tickers:
df = yf.download(tickers=ticker, start=datetime.datetime.now()-relativedelta(days=7), end= datetime.datetime.now(), interval="5m", progress = False)
df['symbol'] = ticker
df.to_sql('stockdata', con=engine, if_exists='append', index=True)
This will produce a table which looks like this but running the code multiple times will create duplicates:
Datetime Open High Low Close Adj Close Volume symbol
0 2022-03-28 13:30:00 00:00 3299.500000 3318.000000 3298.000000 3311.023438 3311.023438 133898 amzn
1 2022-03-28 13:35:00 00:00 3313.000000 3344.479980 3313.000000 3339.060059 3339.060059 104565 amzn
2 2022-03-28 13:40:00 00:00 3348.500000 3365.989990 3342.344482 3354.570068 3354.570068 126489 amzn
3 2022-03-28 13:45:00 00:00 3356.000000 3356.000000 3340.000000 3340.000000 3340.000000 70492 amzn
4 2022-03-28 13:50:00 00:00 3338.500000 3348.669922 3337.385010 3347.584961 3347.584961 64041 amzn
... ... ... ... ... ... ... ... ...
1559 2022-04-01 19:40:00 00:00 223.380005 223.660004 223.039993 223.110001 223.110001 258564 fb
1560 2022-04-01 19:45:00 00:00 223.080002 223.589996 223.059998 223.570007 223.570007 246820 fb
1561 2022-04-01 19:50:00 00:00 223.600006 224.610001 223.600006 224.485001 224.485001 469905 fb
1562 2022-04-01 19:55:00 00:00 224.460007 224.940002 224.460007 224.869995 224.869995 651629 fb
1563 2022-04-01 20:00:00 00:00 224.850006 224.850006 224.850006 224.850006 224.850006 0 fb
1564 rows × 8 columns
There is no primary key in the database(I don't have any control over that). Thus the objective is to make sure that Datetime
is unique to the Symbol
. For example FB can only have 1 instance of unique Datetime
entry. The data is on a 5m
resolution. How can I group
by
on Datetime
and Symbol
and remove duplicate entry based on column Datetime
I was trying to do this earlier:
query = """delete from stockdata s
using
(SELECT "Datetime" , max(ctid) as max_ctid from stockdata group by "Datetime") t
where s.ctid<>t.max_ctid
and s."Datetime"=t."Datetime";"""
But this was looking at just Datetime
duplicates and not based on grouping Symbol
symbol column as well.
To get the a list of duplicate:
SELECT "symbol", "Datetime", COUNT(*) AS CNT
FROM stockdata
GROUP BY "symbol", "Datetime"
HAVING COUNT(*) > 1;
How can I remove all duplicates and keep 1. Any help would be appreciated!
CodePudding user response:
Revised query. Test before using in production:
query = """delete from stockdata s
using
(SELECT "symbol", "Datetime" , max(ctid) as max_ctid from stockdata group by "symbol","Datetime") t
where s.ctid <> t.max_ctid
and s."Datetime" = t."Datetime"
and s."Symbol" = t."Symbol";"""
Also from my comments on previous question, it would be better to add a UNIQUE
constraint on the receiving table to prevent duplicates from be entered in the first place.