I have a dataframe that looks like this:
commitdates api_spec_id/
0 2021-04-07 84
1 2021-05-31 84
2 2021-06-21 84
3 2021-06-18 84
4 2020-12-06 124
commits commitDate
0 32 2021-04-07 12:52:56
1 32 2021-05-31 03:12:37
2 32 2021-06-21 06:50:33
3 32 2021-06-18 05:11:23
4 37 2020-12-06 20:35:45
I want to calculate the time interval elapsed between the first commit and last commit, api_spec_id
corresponds to the relevant API, each API has different commits, hence I want to find the first and last and calculate the interval between them.
My desired output is :
api_spec_id Age (in days)
0 84 89
1 84 89
2 84 89
3 84 67
4 124 56
I tried doing the following after scanning through similar posts here on stack:
gb = final_api.groupby('api_spec_id')['commitDate']
(gb.max() - gb.min()) / pd.Timedelta(days=1)
and got the following output:
api_spec_id
84 74.748345
124 22.486979
164 124.080359
184 921.732488
214 11.994167
...
224530 1.987951
224606 8.221690
224613 67.541366
224627 151.838333
224665 657.721481
And another method:
s = final_api.groupby(['api_spec_id','commitdates'])['timestamp'].agg(['min','max']); s['max']-s['min']
which returned me 0 days.
I am not sure if this is correct, and also I would like to append this result to a new dataframe column, but not sure how to do this. Any help would be appreciated.
CodePudding user response:
Use groupby.transform
with min
/max
(or first
/last
if you really want the order, not values to matter)):
# pre-requisite
df[['commitdates', 'commitDate']] = df[['commitdates', 'commitDate']].apply(pd.to_datetime)
g = df.groupby('api_spec_id')['commitdates']
df['Age (in days)'] = g.transform('max').sub(g.transform('min'))
Output:
commitdates api_spec_id commits commitDate Age (in days)
0 2021-04-07 84 32 2021-04-07 12:52:56 75 days
1 2021-05-31 84 32 2021-05-31 03:12:37 75 days
2 2021-06-21 84 32 2021-06-21 06:50:33 75 days
3 2021-06-18 84 32 2021-06-18 05:11:23 75 days
4 2020-12-06 124 37 2020-12-06 20:35:45 0 days