Home > Back-end >  Split single row in multiple based on date boundary
Split single row in multiple based on date boundary

Time:10-12

I have this dataframe with Start and End timestamps where Start and End occasionally land on different dates:

df = pd.DataFrame({'ID': [1, 2],
                  'Start': ['2022-01-28 20:30:03', '2022-01-29 04:19:07'],
                  'End': ['2022-01-29 03:29:55', '2022-01-31 08:37:59']})

I want to split any row where Start and End crosses a date boundary into multiple records, like this:

df = pd.DataFrame({'ID': [1, 1, 2, 2, 2],
                  'Start': ['2022-01-28 20:30:03', '2022-01-29 00:00:00', '2022-01-29 04:19:07', '2022-01-30 00:00:00', '2022-01-31 00:00:00'],
                  'End': ['2022-01-28 11:59:59', '2022-01-29 03:29:55', '2022-01-29 11:59:59', '2022-01-31 11:59:59', '2022-01-31 08:37:59']})

Basically, if the date for Start and End are not equal, I would like to add a record for each that cleanly ends at the end or beginning of the date.

I think the correct approach is to use df.apply but I'm not sure how to add in the check condition for date being equal. I think a way to handle Start and End spanning multiple days would be to check the difference between them (in days) to know how many records to add.

CodePudding user response:

I think you could do something like this:

  • first found the difference
  • count the days of difference
  • iterate over the difference and update values or append new rows to the data
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({'ID': [1, 2],
                  'Start': ['2022-01-28 20:30:03', '2022-01-29 04:19:07'],
                  'End': ['2022-01-29 03:29:55', '2022-01-31 08:37:59']})
df["Start"]=pd.to_datetime(df["Start"])
df["End"]=pd.to_datetime(df["End"])

df["dif"] = df.End.dt.ceil('D') - df.Start
delta = timedelta(hours=23, minutes=59, seconds=59)

# iterate if there is more than a day difference
for indx, val in df[df["dif"].dt.days >=1].iterrows():
    final_end = val.End
    start = val.Start
    for i in range(val.dif.days 1):
        if i == 0:
            step_end = start.replace(second=0, hour=0, minute=0)   delta
            df.loc[indx, 'End'] = step_end # update end for entry
        else:
            step_end = start.replace(second=0, hour=0, minute=0)   delta
            start = start.replace(second=1, hour=0, minute=0)   delta
            df2 = pd.DataFrame([[val.ID, start, step_end, delta]], columns=df.columns)
            df = pd.concat([df, df2]).reset_index(drop=True)
    df.loc[len(df)-1, 'End'] = final_end  # update end for last appended row

df = df.sort_values('ID').drop("dif", axis=1)

output


    ID  Start   End
0   1   2022-01-28 20:30:03     2022-01-28 23:59:59
2   1   2022-01-29 00:00:00     2022-01-29 03:29:55
1   2   2022-01-29 04:19:07     2022-01-29 23:59:59
3   2   2022-01-30 00:00:00     2022-01-29 23:59:59
4   2   2022-01-31 00:00:00     2022-01-31 08:37:59
  • Related