I have a bunch of minute data like so:
index timestamp open high low ... trade_count vwap symbol cum_volume cum_volume_25k
0 2021-11-02 13:30:00 00:00 2021-11-02 13:30:00 00:00 12.000 12.055 12.00 ... 62 12.003693 AMCR 47208 0
1 2021-11-02 13:31:00 00:00 2021-11-02 13:31:00 00:00 12.040 12.070 12.04 ... 117 12.055675 AMCR 67656 0
2 2021-11-02 13:32:00 00:00 2021-11-02 13:32:00 00:00 12.050 12.050 12.03 ... 64 12.040067 AMCR 74411 0
3 2021-11-02 13:33:00 00:00 2021-11-02 13:33:00 00:00 12.035 12.035 12.01 ... 95 12.021537 AMCR 83597 0
4 2021-11-02 13:34:00 00:00 2021-11-02 13:34:00 00:00 12.035 12.035 12.02 ... 16 12.028696 AMCR 84855 0
... ... ... ... ... ... ... ... ... ... ... ...
429000 2022-10-21 19:59:00 00:00 2022-10-21 19:59:00 00:00 29.850 29.940 29.85 ... 298 29.894083 YETI 688656 0
429001 2022-10-21 20:00:00 00:00 2022-10-21 20:00:00 00:00 29.890 29.890 29.89 ... 15 29.890000 YETI 776882 0
429002 2022-10-21 20:08:00 00:00 2022-10-21 20:08:00 00:00 29.890 29.890 29.89 ... 1 29.890000 YETI 781148 0
429003 2022-10-21 22:06:00 00:00 2022-10-21 22:06:00 00:00 30.060 30.060 30.06 ... 3 30.060000 YETI 782348 0
429004 2022-10-21 22:44:00 00:00 2022-10-21 22:44:00 00:00 30.060 30.060 30.06 ... 1 30.060000 YETI 783148 0
[429005 rows x 12 columns]
I'm then groping the minutes by day:
df['timestamp'] = pd.to_datetime(df['timestamp'])
days = df.groupby(df['timestamp'].dt.normalize().unique())
I then loop over the days:
list_of_df = []
for index, name in days:
list_of_df.append(name)
final = pd.concat(list_of_df)
print(final[['symbol', 'timestamp']])
Which returns:
symbol timestamp
95 AMCR 2021-11-02 15:05:00 00:00
186 AMCR 2021-11-02 16:37:00 00:00
187 AMCR 2021-11-02 16:38:00 00:00
0 AMCR 2021-11-02 13:30:00 00:00
212 AMCR 2021-11-02 17:03:00 00:00
.. ... ...
210 AMCR 2021-11-02 17:01:00 00:00
211 AMCR 2021-11-02 17:02:00 00:00
138 AMCR 2021-11-02 15:49:00 00:00
192 AMCR 2021-11-02 16:43:00 00:00
58 AMCR 2021-11-02 14:28:00 00:00
[230 rows x 2 columns]
How can I make it so it only returns the unique days? e.g
symbol timestamp
95 AMCR 2021-11-02 00:00:00 00:00
97 APPL 2021-11-02 00:00:00 00:00
100 APPL 2021-11-03 00:00:00 00:00
102 APPL 2021-11-06 00:00:00 00:00
[230 rows x 2 columns]
CodePudding user response:
Given (I changed timestamps for more variety):
timestamp open high low symbol
0 2021-11-02 13:30:00 00:00 12.000 12.055 12.00 AMCR
1 2021-11-02 13:31:00 00:00 12.040 12.070 12.04 AMCR
2 2021-11-08 13:32:00 00:00 12.050 12.050 12.03 YETI
3 2021-11-03 13:33:00 00:00 12.035 12.035 12.01 AMCR
4 2022-10-21 19:59:00 00:00 12.035 12.035 12.02 YETI
Doing, assuming you want dates per symbol:
out = (df[['symbol', 'timestamp']]
.assign(timestamp=df.timestamp.dt.normalize())
.drop_duplicates())
print(out)
Output:
symbol timestamp
0 AMCR 2021-11-02 00:00:00 00:00
2 YETI 2021-11-08 00:00:00 00:00
3 AMCR 2021-11-03 00:00:00 00:00
4 YETI 2022-10-21 00:00:00 00:00