I have this dataframe with Start
and End
timestamps where Start
and End
occasionally land on different dates:
df = pd.DataFrame({'ID': [1, 2],
'Start': ['2022-01-28 20:30:03', '2022-01-29 04:19:07'],
'End': ['2022-01-29 03:29:55', '2022-01-31 08:37:59']})
I want to split any row where Start
and End
crosses a date boundary into multiple records, like this:
df = pd.DataFrame({'ID': [1, 1, 2, 2, 2],
'Start': ['2022-01-28 20:30:03', '2022-01-29 00:00:00', '2022-01-29 04:19:07', '2022-01-30 00:00:00', '2022-01-31 00:00:00'],
'End': ['2022-01-28 11:59:59', '2022-01-29 03:29:55', '2022-01-29 11:59:59', '2022-01-31 11:59:59', '2022-01-31 08:37:59']})
Basically, if the date for Start
and End
are not equal, I would like to add a record for each that cleanly ends at the end or beginning of the date.
I think the correct approach is to use df.apply
but I'm not sure how to add in the check condition for date being equal. I think a way to handle Start
and End
spanning multiple days would be to check the difference between them (in days) to know how many records to add.
CodePudding user response:
I think you could do something like this:
- first found the difference
- count the days of difference
- iterate over the difference and update values or append new rows to the data
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({'ID': [1, 2],
'Start': ['2022-01-28 20:30:03', '2022-01-29 04:19:07'],
'End': ['2022-01-29 03:29:55', '2022-01-31 08:37:59']})
df["Start"]=pd.to_datetime(df["Start"])
df["End"]=pd.to_datetime(df["End"])
df["dif"] = df.End.dt.ceil('D') - df.Start
delta = timedelta(hours=23, minutes=59, seconds=59)
# iterate if there is more than a day difference
for indx, val in df[df["dif"].dt.days >=1].iterrows():
final_end = val.End
start = val.Start
for i in range(val.dif.days 1):
if i == 0:
step_end = start.replace(second=0, hour=0, minute=0) delta
df.loc[indx, 'End'] = step_end # update end for entry
else:
step_end = start.replace(second=0, hour=0, minute=0) delta
start = start.replace(second=1, hour=0, minute=0) delta
df2 = pd.DataFrame([[val.ID, start, step_end, delta]], columns=df.columns)
df = pd.concat([df, df2]).reset_index(drop=True)
df.loc[len(df)-1, 'End'] = final_end # update end for last appended row
df = df.sort_values('ID').drop("dif", axis=1)
output
ID Start End
0 1 2022-01-28 20:30:03 2022-01-28 23:59:59
2 1 2022-01-29 00:00:00 2022-01-29 03:29:55
1 2 2022-01-29 04:19:07 2022-01-29 23:59:59
3 2 2022-01-30 00:00:00 2022-01-29 23:59:59
4 2 2022-01-31 00:00:00 2022-01-31 08:37:59