How can I split the difference between two timestamps that contain more than one hour in a Pandas Da-CodePudding

I have a dataframe with 2 timestamp columns ('start' and 'end'). I want to explode the rows of the dataframe such that I can split the timestamp by hours.

Here is an example:

import pandas as pd

df = pd.DataFrame({'id_kanban': [244],'component': ['A'],'start': ['2021-02-02 11:03:18'], 'end': ['2021-02-02 13:33:28']}) 
print(df)

The output I´m getting is:

df1 = pd.DataFrame({
'id_kanban': [244, 244, 244, 244],
'component': ['A', 'A', 'A', 'A'],
'start': ['2021-02-02 11:03:18', '2021-02-02 01:00:00', 
          '2021-02-02 01:00:00', '2021-02-02 01:00:00'],
'end': ['2021-02-02 01:00:00', '2021-02-02 01:00:00', 
        '2021-02-02 01:00:00','2021-02-02 13:33:28']})

However, the output I want to get is:

df2 = pd.DataFrame({
'id_kanban': [244, 244, 244],
'component': ['A', 'A', 'A'],
'start': ['2021-02-02 11:03:18', '2021-02-02 12:00:00', 
          '2021-02-02 13:00:00'],
'end': ['2021-02-02 12:00:00', '2021-02-02 13:00:00', 
        '2021-02-02 13:33:28']})

The code I'm using is similar to the one found in a similar question: How can I split the difference between two timestamps that contain more than one month in a Pandas DataFrame

This is the code I´m trying:

def find_interval(sr):
    dti = pd.date_range(sr['start'], sr['end'], freq='H').normalize() \
            pd.Timedelta(hours=1)
    return list(zip([sr['start']]   dti.tolist(), dti.tolist()   [sr['end']]))

df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)

df1 = df.drop(columns=['start', 'end']) \
    .join(df1).rename(columns={0: 'start', 1: 'end'})

Any ideas on how to output the correct dataframe? Thanks

CodePudding user response：

Try as follows:

# floor start time, ceil end time, and set `inclusive` to "neither"
def find_interval(sr):
    dti = pd.date_range(sr['start'].floor('H'), sr['end'].ceil('H'), freq='H', 
                        inclusive='neither')
    return list(zip([sr['start']]   dti.tolist(), dti.tolist()   [sr['end']]))

# turn your strings into timestamps
df[['start','end']] = df[['start','end']].apply(pd.to_datetime)

df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)

# add reset_index here
df1 = df.drop(columns=['start', 'end']) \
    .join(df1).rename(columns={0: 'start', 1: 'end'}).reset_index(drop=True)

print(df1)

   id_kanban component               start                 end
0        244         A 2021-02-02 11:03:18 2021-02-02 12:00:00
1        244         A 2021-02-02 12:00:00 2021-02-02 13:00:00
2        244         A 2021-02-02 13:00:00 2021-02-02 13:33:28

CodePudding user response：

No reason not to use @ouroboros1 answer but as an alternative approach, which may be a bit more intuitive, here is a solution using piso - a library for working with pandas intervals which has a function to split intervals at given time points.

setup

import pandas as pd
import piso

df = pd.DataFrame({'id_kanban': [244],'component': ['A'],'start': ['2021-02-02 11:03:18'], 'end': ['2021-02-02 13:33:28']})
df[["start", "end"]] =df[["start", "end"]].apply(pd.to_datetime)

solution

Given the addition of id_kanban and component columns I'm assuming you have a larger dataset and may be wanting to group on these columns and split on the hour, so this is the approach given below.

def break_hourly(df):
    
    # create the hourly breakpoints
    hour_timestamps = pd.date_range(
        df["start"].min().ceil("H"),
        df["end"].max().floor("H"),
        freq="H",
    )

    # create pandas.IntervalIndex from time columns and split at breakpoints, resulting in another interval index
    ii = piso.split(
        pd.IntervalIndex.from_arrays(df["start"], df["end"]),
        hour_timestamps
    )

    # create dataframe from new interval index
    return pd.DataFrame({
        "start":ii.left,
        "end":ii.right,
    })


result = (
    df.groupby(["id_kanban", "component"])
    .apply(break_hourly)
    .droplevel(-1)  # remove redundant index level introduced by groupby
    .reset_index()  # index -> columns
)

Disclaimer: I am the author of piso.