Home > Software engineering >  How to group ohlc minute data by day?
How to group ohlc minute data by day?

Time:10-29

I have a bunch of minute data like so:

                            index                  timestamp    open    high    low  ...  trade_count       vwap  symbol  cum_volume cum_volume_25k
0       2021-11-02 13:30:00 00:00  2021-11-02 13:30:00 00:00  12.000  12.055  12.00  ...           62  12.003693    AMCR       47208              0
1       2021-11-02 13:31:00 00:00  2021-11-02 13:31:00 00:00  12.040  12.070  12.04  ...          117  12.055675    AMCR       67656              0
2       2021-11-02 13:32:00 00:00  2021-11-02 13:32:00 00:00  12.050  12.050  12.03  ...           64  12.040067    AMCR       74411              0
3       2021-11-02 13:33:00 00:00  2021-11-02 13:33:00 00:00  12.035  12.035  12.01  ...           95  12.021537    AMCR       83597              0
4       2021-11-02 13:34:00 00:00  2021-11-02 13:34:00 00:00  12.035  12.035  12.02  ...           16  12.028696    AMCR       84855              0
...                           ...                        ...     ...     ...    ...  ...          ...        ...     ...         ...            ...
429000  2022-10-21 19:59:00 00:00  2022-10-21 19:59:00 00:00  29.850  29.940  29.85  ...          298  29.894083    YETI      688656              0
429001  2022-10-21 20:00:00 00:00  2022-10-21 20:00:00 00:00  29.890  29.890  29.89  ...           15  29.890000    YETI      776882              0
429002  2022-10-21 20:08:00 00:00  2022-10-21 20:08:00 00:00  29.890  29.890  29.89  ...            1  29.890000    YETI      781148              0
429003  2022-10-21 22:06:00 00:00  2022-10-21 22:06:00 00:00  30.060  30.060  30.06  ...            3  30.060000    YETI      782348              0
429004  2022-10-21 22:44:00 00:00  2022-10-21 22:44:00 00:00  30.060  30.060  30.06  ...            1  30.060000    YETI      783148              0

[429005 rows x 12 columns]

I'm then groping the minutes by day:

df['timestamp'] = pd.to_datetime(df['timestamp'])
days = df.groupby(df['timestamp'].dt.normalize().unique())

I then loop over the days:

list_of_df = []
for index, name in days:
    list_of_df.append(name)
final = pd.concat(list_of_df)
print(final[['symbol', 'timestamp']])

Which returns:

    symbol                 timestamp
95    AMCR 2021-11-02 15:05:00 00:00
186   AMCR 2021-11-02 16:37:00 00:00
187   AMCR 2021-11-02 16:38:00 00:00
0     AMCR 2021-11-02 13:30:00 00:00
212   AMCR 2021-11-02 17:03:00 00:00
..     ...                       ...
210   AMCR 2021-11-02 17:01:00 00:00
211   AMCR 2021-11-02 17:02:00 00:00
138   AMCR 2021-11-02 15:49:00 00:00
192   AMCR 2021-11-02 16:43:00 00:00
58    AMCR 2021-11-02 14:28:00 00:00

[230 rows x 2 columns]

How can I make it so it only returns the unique days? e.g

    symbol                 timestamp
95    AMCR 2021-11-02 00:00:00 00:00
97    APPL 2021-11-02 00:00:00 00:00
100   APPL 2021-11-03 00:00:00 00:00
102   APPL 2021-11-06 00:00:00 00:00

[230 rows x 2 columns]

CodePudding user response:

Given (I changed timestamps for more variety):

                   timestamp    open    high    low  symbol
0  2021-11-02 13:30:00 00:00  12.000  12.055  12.00    AMCR
1  2021-11-02 13:31:00 00:00  12.040  12.070  12.04    AMCR
2  2021-11-08 13:32:00 00:00  12.050  12.050  12.03    YETI
3  2021-11-03 13:33:00 00:00  12.035  12.035  12.01    AMCR
4  2022-10-21 19:59:00 00:00  12.035  12.035  12.02    YETI

Doing, assuming you want dates per symbol:

out = (df[['symbol', 'timestamp']]
         .assign(timestamp=df.timestamp.dt.normalize())
         .drop_duplicates())
print(out)

Output:

  symbol                 timestamp
0   AMCR 2021-11-02 00:00:00 00:00
2   YETI 2021-11-08 00:00:00 00:00
3   AMCR 2021-11-03 00:00:00 00:00
4   YETI 2022-10-21 00:00:00 00:00
  • Related