separate rows based on timestamps-CodePudding

My dataset looks like this:

      main_id            time_stamp                        
          aaa            2019-05-29 08:16:05 05     
          aaa            2019-05-30 00:11:05 05     
          aaa            2020-05-30 09:15:07 05     
          bbb            2019-05-29 09:11:05 05

For each main_id, I want to:

a) sort the time_stamps in an ascending order

b) I want to create a new column day, which uses the time_stamp to derive a number that describes the business day.

Business days are defined like this:

Monday 05:00 - Tuesday 01:00 (1 Business Day i.e Monday)

Tuesday 05:00 - Wednesday 01:00 => (1 Business Day i.e Tuesday)

and so on...

The first and second rows with main_id = aaaare from the same business day since the second row is showing time before 1 am on the next day. So, this is the very first business day and the day column would have 1.

However, in the third row, the timestamp is from another business day so we add 2 as the day.

The end result could look something like this:

      main_id        time_stamp                             day
          aaa            2019-05-29 08:16:05 05              1
          aaa            2019-05-30 00:11:05 05              1
          aaa            2020-05-30 09:15:07 05              2
          bbb            2019-05-29 09:11:05 05              1

Day 1 would be anywhere between the first 5:00 am - next day's 1 am. While day 2 would be the next possible business day (next 5 am - 1 am)

How can I achieve this?

CodePudding user response：

In order to sort the timestamps in ascending order, do this:

#Let's say the dataframe is df
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
df.sort_values(by='time_stamp')

For the business days one, I would do this:

day1= #add the end of the first businesss date, like:  2019-05-30 01:00
df['day']=1
for i in df.index:
    df['day'].iloc[i] =ceil(df['day'].iloc[i] - day1)

CodePudding user response：

A simple method would be to subtract 5 hours, then to group by sorted dates to get the group number:

df['time_stamp'] = pd.to_datetime(df['time_stamp'])
s = df['time_stamp'].sub(pd.Timedelta('5h'))
df['day'] = df.groupby(s.dt.date).ngroup().add(1)

NB. you actually don't need to sort the values, groupby sorts the value by default.

Variant to apply per "main_id":

df['day'] = (df.groupby('main_id')
               .apply(lambda d: d.groupby(s.dt.date).ngroup().add(1)).droplevel(0)
            )

Output:

  main_id                time_stamp  day
0     aaa 2019-05-29 08:16:05 05:00    1
1     aaa 2019-05-30 00:11:05 05:00    1
2     aaa 2020-05-30 09:15:07 05:00    2
3     bbb 2019-05-29 09:11:05 05:00    1