I have a dataset similar to this.
df = pd.DataFrame({'CODE':["PNR1", "PNR1", "PNR1", "PNR1","PNR2","PNR2","PNR3" ],
'FROM': ["RDM", "DEN", "FLL", "IAH", "AEP", "COR", "DEL"],
'TO' : [ "DEN", "FLL", "IAH","SFO", "COR", "AEP", "BOM"],
'DEPART':["13-07-2021 13:00:00","13-07-2021 17:50:00","15-07-2021 17:15:00","15-07-2021 19:40:00", "22-09-2021 09:05:00", "22-09-2021 17:05:00", "22-09-2021 06:05:00"],
'ARRIVAL':["13-07-2021 16:26:00","13-07-2021 23:39:00","15-07-2021 18:52:00", "15-07-2021 21:27:00","22-09-2021 10:30:00","22-09-2021 18:20:00","22-09-2021 08:05:00"]})
df['NUMSTOPS'] = df.groupby('CODE')['CODE'].transform('count')
df['DEPART'] = pd.to_datetime(df['DEPART'], format='%d-%m-%Y %H:%M:%S')
df['ARRIVAL'] = pd.to_datetime(df['ARRIVAL'], format='%d-%m-%Y %H:%M:%S')
I want to find the time difference for each group to find out the time difference of 'DEPART' from the previous arrival and show it in hours. I have applied the following:
df['LAYOVER']=df.groupby('CODE').apply(lambda x:x['DEPART']- x['ARRIVAL']).shift(1).fillna('0').reset_index(drop=True)
But that gives me the following output:
CODE FROM TO DEPART ARRIVAL NUMSTOPS LAYOVER
0 PNR1 RDM DEN 2021-07-13 13:00:00 2021-07-13 16:26:00 4 00:00:00
1 PNR1 DEN FLL 2021-07-13 17:50:00 2021-07-13 23:39:00 4 -1 days 20:34:00
2 PNR1 FLL IAH 2021-07-15 17:15:00 2021-07-15 18:52:00 4 -1 days 18:11:00
3 PNR1 IAH SFO 2021-07-15 19:40:00 2021-07-15 21:27:00 4 -1 days 22:23:00
4 PNR2 AEP COR 2021-09-22 09:05:00 2021-09-22 10:30:00 2 -1 days 22:13:00
5 PNR2 COR AEP 2021-09-22 17:05:00 2021-09-22 18:20:00 2 -1 days 22:35:00
6 PNR3 DEL BOM 2021-09-22 06:05:00 2021-09-22 08:05:00 1 -1 days 22:45:00
I don't understand what am I doing wrong and how to proceed with calculating the departure time and the previous arrival for each group.
Expected output:
CodePudding user response:
You have a parentheses out of place. .shift() was outside your parenthesis. fixed here.
df['LAYOVER']=df.groupby('CODE').apply(lambda x:x['DEPART']- x['ARRIVAL'].shift(1)).fillna('0').reset_index(drop=True)
df['LAYOVER'].apply(lambda x: pd.Timedelta(x).seconds)
0 0
1 5040
2 63360
3 2880
4 0
5 23700
6 0
Name: LAYOVER, dtype: int64
this is more concise:
df.groupby('CODE').apply(lambda x:x['DEPART']- x['ARRIVAL'].shift(1)).fillna('0').reset_index(drop=True).dt.seconds