My dataset looks like this:
main_id time_stamp
aaa 2019-05-29 08:16:05 05
aaa 2019-05-30 00:11:05 05
aaa 2020-05-30 09:15:07 05
bbb 2019-05-29 09:11:05 05
For each main_id
, I want to:
a) sort the time_stamps in an ascending order
b) I want to create a new column day
, which uses the time_stamp to derive a number that describes the business day.
Business days are defined like this:
Monday 05:00 - Tuesday 01:00 (1 Business Day i.e Monday)
Tuesday 05:00 - Wednesday 01:00 => (1 Business Day i.e Tuesday)
and so on...
The first and second rows with main_id = aaa
are from the same business day since the second row is showing time before 1 am on the next day. So, this is the very first business day and the day
column would have 1
.
However, in the third row, the timestamp is from another business day so we add 2
as the day
.
The end result could look something like this:
main_id time_stamp day
aaa 2019-05-29 08:16:05 05 1
aaa 2019-05-30 00:11:05 05 1
aaa 2020-05-30 09:15:07 05 2
bbb 2019-05-29 09:11:05 05 1
Day 1 would be anywhere between the first 5:00 am - next day's 1 am. While day 2 would be the next possible business day (next 5 am - 1 am)
How can I achieve this?
CodePudding user response:
- In order to sort the timestamps in ascending order, do this:
#Let's say the dataframe is df
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
df.sort_values(by='time_stamp')
- For the business days one, I would do this:
day1= #add the end of the first businesss date, like: 2019-05-30 01:00
df['day']=1
for i in df.index:
df['day'].iloc[i] =ceil(df['day'].iloc[i] - day1)
CodePudding user response:
A simple method would be to subtract 5 hours, then to group by sorted dates to get the group number:
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
s = df['time_stamp'].sub(pd.Timedelta('5h'))
df['day'] = df.groupby(s.dt.date).ngroup().add(1)
NB. you actually don't need to sort the values, groupby
sorts the value by default.
Variant to apply per "main_id":
df['day'] = (df.groupby('main_id')
.apply(lambda d: d.groupby(s.dt.date).ngroup().add(1)).droplevel(0)
)
Output:
main_id time_stamp day
0 aaa 2019-05-29 08:16:05 05:00 1
1 aaa 2019-05-30 00:11:05 05:00 1
2 aaa 2020-05-30 09:15:07 05:00 2
3 bbb 2019-05-29 09:11:05 05:00 1