Get rows of first and last datetime records based on a column-CodePudding

I have a df with several columns and can't figure out in Python how to get only the rows with the first and last date/time (columns is in pandas datetime64[ns] Dtype) entry for each id. And also how to create a new df with the datetime difference in days?

Here's a df slice:

user_id	length	datetime
d8366132199	20	2021-03-31 17:55:44
d8366132199	40	2021-05-12 13:55:48
d8366132199	10	2021-04-09 16:55:47
d8366132199	70	2021-06-25 17:58:47
d8366132199	80	2021-06-28 18:55:49
e8172920273	10	2021-03-28 18:55:44
e8172920273	50	2021-03-30 18:55:45
e8172920273	60	2021-05-26 18:55:46
e8172920273	90	2021-05-26 18:55:47

Expected output:

user_id	length	datetime
d8366132199	20	2021-03-31 17:55:44
d8366132199	80	2021-06-28 18:55:49
e8172920273	10	2021-03-28 18:55:44
e8172920273	90	2021-05-26 18:55:47

New df with datetime difference, expected output:

user_id	date_diff
d8366132199	276
e8172920273	60

Many thanks!

CodePudding user response：

You can get the idxmin/idxmax per group, then use the values to slice the original dataset:

# get the indices of min/max per group, then flatten as array
idx = df.groupby('user_id')['datetime'].agg(['idxmin', 'idxmax']).values.ravel()
# array([0, 4, 5, 8])

# slice
df.loc[idx]

output:

        user_id  length            datetime
0  d8366132199       20 2021-03-31 17:55:44
4  d8366132199       80 2021-06-28 18:55:49
5  e8172920273       10 2021-03-28 18:55:44
8  e8172920273       90 2021-05-26 18:55:47

CodePudding user response：

For the first question, you can use .nth()

ndf = df.sort_values('datetime').groupby('user_id',as_index = False).nth([0,-1])

Output:

       user_id  length            datetime
5  e8172920273      10 2021-03-28 18:55:44
0  d8366132199      20 2021-03-31 17:55:44
8  e8172920273      90 2021-05-26 18:55:47
4  d8366132199      80 2021-06-28 18:55:49

From there, to answer the second question, you can subtract the last row from the first row.

(ndf.groupby('user_id',as_index = False)
 .agg(date_diff = ('datetime',lambda x: (x.iloc[-1] - x.iloc[0]).days)))

Output:

       user_id  date_diff
0  d8366132199         89
1  e8172920273         59