Measure different between timestamps using conditions

I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == B and the Start Time where Item == C.

So the timestamps are actually located on separate rows.

At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C - A,B - B,C etc.

df = pd.DataFrame({'ID': [10,10,10,20,20,30],
                   'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-04 15:00:00','2019-08-04 08:00:00','2019-08-05 10:30:00','2019-08-06 11:00:00'], 
                   'End Time': ['2019-08-02 09:30:00','2019-08-03 11:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'], 
                   'Item': ['A','B','C','B','C','A'],
                    })

df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])

df['diff'] = (df.groupby('ID')
                .apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
                .fillna('-')
                .reset_index(drop=True))

Error:

TypeError: value should be a 'Timedelta', 'NaT', or array of those. Got 'str' instead.

Intended Output:

   ID          Start Time            End Time Item diff
0  10 2019-08-02 09:00:00 2019-08-02 09:30:00    A  NaN
1  10 2019-08-03 10:50:00 2019-08-03 11:00:00    B  
2  10 2019-08-04 15:00:00 2019-08-05 16:00:00    C  '16hrs'
3  20 2019-08-04 08:00:00 2019-08-04 14:00:00    B
4  20 2019-08-05 10:30:00 2019-08-05 20:30:00    C  '20.5hrs'
5  30 2019-08-06 11:00:00 2019-08-07 10:00:00    A

CodePudding user response：

I think you can avoid apply with use DataFrameGroupBy.shift:

df1 = df.groupby('ID')[['End Time', 'Start Time']].shift()
df['diff'] = df1['End Time'].sub(df1['Start Time'])

print (df)
   ID          Start Time            End Time Item            diff
0  10 2019-08-02 09:00:00 2019-08-02 09:30:00    A             NaT
1  10 2019-08-03 10:50:00 2019-08-03 11:00:00    B 0 days 00:30:00
2  10 2019-08-04 15:00:00 2019-08-05 16:00:00    C 0 days 00:10:00
3  20 2019-08-04 08:00:00 2019-08-04 14:00:00    B             NaT
4  20 2019-08-05 10:30:00 2019-08-05 20:30:00    C 0 days 06:00:00
5  30 2019-08-06 11:00:00 2019-08-07 10:00:00    A             NaT

CodePudding user response：

The issue is your fillna, you can't have strings in a timedelta column:

df['diff'] = (df.groupby('ID')
                .apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
                #.fillna('-')  # the issue is here
                .reset_index(drop=True))

output:

   ID          Start Time            End Time Item            diff
0  10 2019-08-02 09:00:00 2019-08-02 09:30:00    A             NaT
1  10 2019-08-03 10:50:00 2019-08-03 11:00:00    B 0 days 00:30:00
2  10 2019-08-04 15:00:00 2019-08-05 16:00:00    C 0 days 00:10:00
3  20 2019-08-04 08:00:00 2019-08-04 14:00:00    B             NaT
4  20 2019-08-05 10:30:00 2019-08-05 20:30:00    C 0 days 06:00:00
5  30 2019-08-06 11:00:00 2019-08-07 10:00:00    A             NaT