I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID
, I'm hoping to subtract the End Time
where Item == B
and the Start Time
where Item == C
.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift()
for something more robust as each unique ID
will have different combinations. For ex, A,B,C
- A,B
- B,C
etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-04 15:00:00','2019-08-04 08:00:00','2019-08-05 10:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-02 09:30:00','2019-08-03 11:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','C','B','C','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.fillna('-')
.reset_index(drop=True))
Error:
TypeError: value should be a 'Timedelta', 'NaT', or array of those. Got 'str' instead.
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaN
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C '16hrs'
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C '20.5hrs'
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A
CodePudding user response:
I think you can avoid apply
with use DataFrameGroupBy.shift
:
df1 = df.groupby('ID')[['End Time', 'Start Time']].shift()
df['diff'] = df1['End Time'].sub(df1['Start Time'])
print (df)
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
CodePudding user response:
The issue is your fillna
, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT