I have the following sample df
df = pd.DataFrame({'ID':['A','A','B','B'],'TimeStamp':['2022-08-02T17:33:44.358Z',
'2022-08-02T17:33:44.600Z',
'2022-08-02T17:33:44.814Z',
'2022-08-02T17:33:45.028Z']})
I want to groupby Id, and get the timedelta difference between the timestamps, i manage to get something similar to the wanted series. Through this code. Although, it is taking quite a long time, is there a way to do it more efficiently?
df.assign(post_data = df['TimeStamp'].shift(1)).groupby(['Id'])[['TimeStamp','post_data']].apply(lambda x : (x.iloc[:,0] - x.iloc[:,1]).to_frame('diff'))
Wanted series
{'diff': {0: NaT,
1: Timedelta('0 days 00:00:00.242000'),
2: Timedelta('0 days 00:00:00.214000'),
3: Timedelta('0 days 00:00:00.214000')}
CodePudding user response:
here is one way about it btw, if you groupby ID, then the desired result you shared is incorrected. the third row should be zero since its a different ID
#convert the timeStamp to timestamp
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'])
# create post_data via vectorization intead of lambda, it'll be fast
df['post_data']=df.groupby('ID')['TimeStamp'].shift(1)
#finally, take the difference
df['diff'] = df['TimeStamp'].sub(df['post_data'])
df
ID TimeStamp post_data diff
0 A 2022-08-02 17:33:44.358000 00:00 NaT NaT
1 A 2022-08-02 17:33:44.600000 00:00 2022-08-02 17:33:44.358000 00:00 0 days 00:00:00.242000
2 B 2022-08-02 17:33:44.814000 00:00 NaT NaT
3 B 2022-08-02 17:33:45.028000 00:00 2022-08-02 17:33:44.814000 00:00 0 days 00:00:00.214000