I have a df with several columns and can't figure out in Python how to get only the rows with the first and last date/time (columns is in pandas datetime64[ns] Dtype) entry for each id. And also how to create a new df with the datetime difference in days?
Here's a df slice:
user_id | length | datetime |
---|---|---|
d8366132199 | 20 | 2021-03-31 17:55:44 |
d8366132199 | 40 | 2021-05-12 13:55:48 |
d8366132199 | 10 | 2021-04-09 16:55:47 |
d8366132199 | 70 | 2021-06-25 17:58:47 |
d8366132199 | 80 | 2021-06-28 18:55:49 |
e8172920273 | 10 | 2021-03-28 18:55:44 |
e8172920273 | 50 | 2021-03-30 18:55:45 |
e8172920273 | 60 | 2021-05-26 18:55:46 |
e8172920273 | 90 | 2021-05-26 18:55:47 |
Expected output:
user_id | length | datetime |
---|---|---|
d8366132199 | 20 | 2021-03-31 17:55:44 |
d8366132199 | 80 | 2021-06-28 18:55:49 |
e8172920273 | 10 | 2021-03-28 18:55:44 |
e8172920273 | 90 | 2021-05-26 18:55:47 |
New df with datetime difference, expected output:
user_id | date_diff |
---|---|
d8366132199 | 276 |
e8172920273 | 60 |
Many thanks!
CodePudding user response:
You can get the idxmin
/idxmax
per group, then use the values to slice the original dataset:
# get the indices of min/max per group, then flatten as array
idx = df.groupby('user_id')['datetime'].agg(['idxmin', 'idxmax']).values.ravel()
# array([0, 4, 5, 8])
# slice
df.loc[idx]
output:
user_id length datetime
0 d8366132199 20 2021-03-31 17:55:44
4 d8366132199 80 2021-06-28 18:55:49
5 e8172920273 10 2021-03-28 18:55:44
8 e8172920273 90 2021-05-26 18:55:47
CodePudding user response:
For the first question, you can use .nth()
ndf = df.sort_values('datetime').groupby('user_id',as_index = False).nth([0,-1])
Output:
user_id length datetime
5 e8172920273 10 2021-03-28 18:55:44
0 d8366132199 20 2021-03-31 17:55:44
8 e8172920273 90 2021-05-26 18:55:47
4 d8366132199 80 2021-06-28 18:55:49
From there, to answer the second question, you can subtract the last row from the first row.
(ndf.groupby('user_id',as_index = False)
.agg(date_diff = ('datetime',lambda x: (x.iloc[-1] - x.iloc[0]).days)))
Output:
user_id date_diff
0 d8366132199 89
1 e8172920273 59