Split single row in multiple based on date boundary-CodePudding

I have this dataframe with Start and End timestamps where Start and End occasionally land on different dates:

df = pd.DataFrame({'ID': [1, 2],
                  'Start': ['2022-01-28 20:30:03', '2022-01-29 04:19:07'],
                  'End': ['2022-01-29 03:29:55', '2022-01-31 08:37:59']})

I want to split any row where Start and End crosses a date boundary into multiple records, like this:

df = pd.DataFrame({'ID': [1, 1, 2, 2, 2],
                  'Start': ['2022-01-28 20:30:03', '2022-01-29 00:00:00', '2022-01-29 04:19:07', '2022-01-30 00:00:00', '2022-01-31 00:00:00'],
                  'End': ['2022-01-28 11:59:59', '2022-01-29 03:29:55', '2022-01-29 11:59:59', '2022-01-31 11:59:59', '2022-01-31 08:37:59']})

Basically, if the date for Start and End are not equal, I would like to add a record for each that cleanly ends at the end or beginning of the date.

I think the correct approach is to use df.apply but I'm not sure how to add in the check condition for date being equal. I think a way to handle Start and End spanning multiple days would be to check the difference between them (in days) to know how many records to add.

CodePudding user response：

I think you could do something like this:

first found the difference
count the days of difference
iterate over the difference and update values or append new rows to the data

import pandas as pd
from datetime import timedelta
df = pd.DataFrame({'ID': [1, 2],
                  'Start': ['2022-01-28 20:30:03', '2022-01-29 04:19:07'],
                  'End': ['2022-01-29 03:29:55', '2022-01-31 08:37:59']})
df["Start"]=pd.to_datetime(df["Start"])
df["End"]=pd.to_datetime(df["End"])

df["dif"] = df.End.dt.ceil('D') - df.Start
delta = timedelta(hours=23, minutes=59, seconds=59)

# iterate if there is more than a day difference
for indx, val in df[df["dif"].dt.days >=1].iterrows():
    final_end = val.End
    start = val.Start
    for i in range(val.dif.days 1):
        if i == 0:
            step_end = start.replace(second=0, hour=0, minute=0)   delta
            df.loc[indx, 'End'] = step_end # update end for entry
        else:
            step_end = start.replace(second=0, hour=0, minute=0)   delta
            start = start.replace(second=1, hour=0, minute=0)   delta
            df2 = pd.DataFrame([[val.ID, start, step_end, delta]], columns=df.columns)
            df = pd.concat([df, df2]).reset_index(drop=True)
    df.loc[len(df)-1, 'End'] = final_end  # update end for last appended row

df = df.sort_values('ID').drop("dif", axis=1)

output


    ID  Start   End
0   1   2022-01-28 20:30:03     2022-01-28 23:59:59
2   1   2022-01-29 00:00:00     2022-01-29 03:29:55
1   2   2022-01-29 04:19:07     2022-01-29 23:59:59
3   2   2022-01-30 00:00:00     2022-01-29 23:59:59
4   2   2022-01-31 00:00:00     2022-01-31 08:37:59