Imagine I have a dataframe that contains minute data for different symbols:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-26 08:20:00 00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADA 2889145.1
1 2022-09-26 08:25:00 00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADA 2889145.1
2 2022-09-26 08:30:00 00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADA 2889145.1
3 2022-09-26 08:35:00 00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADA 2889145.1
4 2022-09-26 08:40:00 00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADA 2889145.1
--
100 2022-09-26 08:20:00 00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADD 2889145.1
101 2022-09-26 08:25:00 00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADD 2889145.1
102 2022-09-26 08:30:00 00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADD 2889145.1
103 2022-09-26 08:35:00 00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADD 2889145.1
104 2022-09-26 08:40:00 00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADD 2889145.1
I want to be able to filter the list, so that it only returns a single dataframe with multiple days, but that no days are repeated (like in the example above where ADA and ADD both appear for the date 2022-09-26).
How can I filter out duplicate days like this? I don't care how it's done - it could be just keeping whatever symbol appears first for a given date, like this for example:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-26 08:20:00 00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADA 2889145.1
1 2022-09-26 08:25:00 00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADA 2889145.1
2 2022-09-26 08:30:00 00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADA 2889145.1
3 2022-09-26 08:35:00 00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADA 2889145.1
4 2022-09-26 08:40:00 00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADA 2889145.1
--
100 2022-09-27 08:20:00 00:00 1.58 1.59 1.34 1.34 972 15 1.433220 ADB 2889145.1
101 2022-09-27 08:25:00 00:00 1.45 1.66 1.41 1.66 3778 25 1.551821 ADB 2889145.1
102 2022-09-27 08:30:00 00:00 1.70 1.70 1.39 1.47 13683 59 1.499826 ADB 2889145.1
103 2022-09-27 08:35:00 00:00 1.43 1.50 1.37 1.37 3627 10 1.406485 ADB 2889145.1
104 2022-09-27 08:40:00 00:00 1.40 1.44 1.40 1.44 1352 9 1.408365 ADB 2889145.1
How can I achieve this?
Update, tried drop_duplicates as suggested by Lukas, like so:
Read from db in a df:
df = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
Get the length (4769):
print(len(df))
And then:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.drop_duplicates(subset=['symbol', 'timestamp'])
print(len(df))
But it returns the same length.
How can I get my drop_duplicates to work with minute data?
CodePudding user response:
You can use pd.drop_duplicates:
df.drop_duplicates(subset=['timestamp', 'symbol'])
By default, it will take the first appearance of the combination of the values in the timestamp
and symbol
columns, but you can change this behavior.