How to filter out the rows with the latest and earliest dates for each item

I want to filter out the rows with the latest and earliest dates for each item_id using pandas. Thanks in advance for your kind help :)

item_id	favorite	date
1049977	8	2022-06-09
1049977	8	2022-06-10
1049977	8	2022-06-11
1049977	8	2022-06-12
1611924	168	2022-06-09
1611924	168	2022-06-10
1611924	168	2022-06-11
1611924	168	2022-06-12
1611924	168	2022-06-13

The ideal output is:

item_id	favorite	date
1049977	8	2022-06-09
1049977	8	2022-06-12
1611224	168	2022-06-09
1611924	168	2022-06-13

Raw data is here:

item_id,favorite,date
1049977,8,2022-06-09
1049977,8,2022-06-10
1049977,8,2022-06-11
1049977,8,2022-06-12
1611924,168,2022-06-09
1611924,168,2022-06-10
1611924,168,2022-06-11
1611924,168,2022-06-12
1611924,168,2022-06-13

CodePudding user response：

This is best done using groupby operations. When starting out it might be tempting to do this with a loop but that will seriously impact performance as the number of rows grows.

Using groupby methods .first() and .last() gives you the first and last entry. Note that you need to ensure that your dataframe is sorted by whichever column designate your date for this to work.

.concat is just an easy way to put the results together into one df. There might be nicer options.

This snippet:

pd.concat([
    df.groupby('item_id').first(),
    df.groupby('item_id').last()
    ]).sort_index()

Gives me this output:

         favorite        date
item_id                      
1049977         8  2022-06-09
1049977         8  2022-06-12
1611924       168  2022-06-09
1611924       168  2022-06-13

CodePudding user response：

You can filter the group with max, min date value in groupby.apply or in transform(min) and transform(max)

out = (df.groupby('item_id')
       .apply(lambda g: g[g['date'].isin([g['date'].min(), g['date'].max()])])
       .reset_index(drop=True))

# or

groups = df.groupby(['item_id'])['date']
out = df[(df['date']==groups.transform(min)) | (df['date']==groups.transform(max))]

print(out)

   item_id  favorite        date
0  1049977         8  2022-06-09
3  1049977         8  2022-06-12
4  1611924       168  2022-06-09
8  1611924       168  2022-06-13

CodePudding user response：

Here is a solution that does not require the input dataframe to be sorted. For each item_id, it obtains the index values of the rows where dates are either earliest (minimum) or latest (maximum) and then filters the original dataframe using those index values.

df.iloc[df.groupby('item_id').date.agg(['idxmin','idxmax']).values.ravel()]

    item_id favorite          date
0   1049977        8    2022-06-09
3   1049977        8    2022-06-12
4   1611924      168    2022-06-09
8   1611924      168    2022-06-13

CodePudding user response：

df[df.groupby('item_id')['date'].transform(lambda x: (x.max() == x) | (x.min() == x))]

Output:

   item_id  favorite       date
0  1049977         8 2022-06-09
3  1049977         8 2022-06-12
4  1611924       168 2022-06-09
8  1611924       168 2022-06-13

CodePudding user response：

Alternative method:

df=pd.DataFrame({'item_id': [1049977, 1049977, 1049977, 1049977,1611924,1611924,1611924,1611924,1611924],
                 'favorite': [8,8,8,8,168,168,168,168,168],
                 'date': ['2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12',
                          '2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12', '2022-06-13']})

df['date'] = pd.to_datetime(df['date'])


df['nrows'] = df.groupby(['item_id', 'favorite']).cumcount() 1

df['max_row'] = df.groupby(['item_id', 'favorite'])['nrows'].transform('max')
df['min_row'] = df.groupby(['item_id', 'favorite'])['nrows'].transform('min')

df['flag_min'] = df.apply(lambda x: 1 if (x['nrows']==x['min_row']) else 0, axis = 1)
df['flag_max'] = df.apply(lambda x: 1 if (x['nrows']==x['max_row']) else 0, axis = 1)

df1 = df[(df['flag_min']== 0) & (df['flag_max']== 0)].drop(['nrows','max_row', 'min_row', 'flag_min', 'flag_max'], axis = 1)

print(df1)