Home > Blockchain >  How to remove duplicate days with multiple tickers in a single dataframe?
How to remove duplicate days with multiple tickers in a single dataframe?

Time:11-24

Imagine I have a dataframe that contains minute data for different symbols:

                   timestamp  open  high   low  close  volume  trade_count      vwap symbol  volume_10_day
0     2022-09-26 08:20:00 00:00  1.58  1.59  1.34   1.34     972           15  1.433220    ADA      2889145.1
1     2022-09-26 08:25:00 00:00  1.45  1.66  1.41   1.66    3778           25  1.551821    ADA      2889145.1
2     2022-09-26 08:30:00 00:00  1.70  1.70  1.39   1.47   13683           59  1.499826    ADA      2889145.1
3     2022-09-26 08:35:00 00:00  1.43  1.50  1.37   1.37    3627           10  1.406485    ADA      2889145.1
4     2022-09-26 08:40:00 00:00  1.40  1.44  1.40   1.44    1352            9  1.408365    ADA      2889145.1
--
100     2022-09-26 08:20:00 00:00  1.58  1.59  1.34   1.34     972           15  1.433220    ADD      2889145.1
101     2022-09-26 08:25:00 00:00  1.45  1.66  1.41   1.66    3778           25  1.551821    ADD      2889145.1
102     2022-09-26 08:30:00 00:00  1.70  1.70  1.39   1.47   13683           59  1.499826    ADD      2889145.1
103     2022-09-26 08:35:00 00:00  1.43  1.50  1.37   1.37    3627           10  1.406485    ADD      2889145.1
104     2022-09-26 08:40:00 00:00  1.40  1.44  1.40   1.44    1352            9  1.408365    ADD      2889145.1

I want to be able to filter the list, so that it only returns a single dataframe with multiple days, but that no days are repeated (like in the example above where ADA and ADD both appear for the date 2022-09-26).

How can I filter out duplicate days like this? I don't care how it's done - it could be just keeping whatever symbol appears first for a given date, like this for example:

                      timestamp  open  high   low  close  volume  trade_count      vwap symbol  volume_10_day
0     2022-09-26 08:20:00 00:00  1.58  1.59  1.34   1.34     972           15  1.433220    ADA      2889145.1
1     2022-09-26 08:25:00 00:00  1.45  1.66  1.41   1.66    3778           25  1.551821    ADA      2889145.1
2     2022-09-26 08:30:00 00:00  1.70  1.70  1.39   1.47   13683           59  1.499826    ADA      2889145.1
3     2022-09-26 08:35:00 00:00  1.43  1.50  1.37   1.37    3627           10  1.406485    ADA      2889145.1
4     2022-09-26 08:40:00 00:00  1.40  1.44  1.40   1.44    1352            9  1.408365    ADA      2889145.1
--
100     2022-09-27 08:20:00 00:00  1.58  1.59  1.34   1.34     972           15  1.433220    ADB      2889145.1
101     2022-09-27 08:25:00 00:00  1.45  1.66  1.41   1.66    3778           25  1.551821    ADB      2889145.1
102     2022-09-27 08:30:00 00:00  1.70  1.70  1.39   1.47   13683           59  1.499826    ADB      2889145.1
103     2022-09-27 08:35:00 00:00  1.43  1.50  1.37   1.37    3627           10  1.406485    ADB      2889145.1
104     2022-09-27 08:40:00 00:00  1.40  1.44  1.40   1.44    1352            9  1.408365    ADB      2889145.1

How can I achieve this?

Update, tried drop_duplicates as suggested by Lukas, like so:

Read from db in a df:

df = pd.read_sql_query("SELECT * from ohlc_minutes", conn)

Get the length (4769):

print(len(df))

And then:

df['timestamp'] = pd.to_datetime(df['timestamp']) 
df.drop_duplicates(subset=['symbol', 'timestamp'])
print(len(df))

But it returns the same length.

How can I get my drop_duplicates to work with minute data?

CodePudding user response:

You can use pd.drop_duplicates:

df.drop_duplicates(subset=['timestamp', 'symbol'])

By default, it will take the first appearance of the combination of the values in the timestamp and symbol columns, but you can change this behavior.

  • Related