I want to filter out the rows with the latest and earliest dates for each item_id using pandas. Thanks in advance for your kind help :)
item_id | favorite | date |
---|---|---|
1049977 | 8 | 2022-06-09 |
1049977 | 8 | 2022-06-10 |
1049977 | 8 | 2022-06-11 |
1049977 | 8 | 2022-06-12 |
1611924 | 168 | 2022-06-09 |
1611924 | 168 | 2022-06-10 |
1611924 | 168 | 2022-06-11 |
1611924 | 168 | 2022-06-12 |
1611924 | 168 | 2022-06-13 |
The ideal output is:
item_id | favorite | date |
---|---|---|
1049977 | 8 | 2022-06-09 |
1049977 | 8 | 2022-06-12 |
1611224 | 168 | 2022-06-09 |
1611924 | 168 | 2022-06-13 |
Raw data is here:
item_id,favorite,date
1049977,8,2022-06-09
1049977,8,2022-06-10
1049977,8,2022-06-11
1049977,8,2022-06-12
1611924,168,2022-06-09
1611924,168,2022-06-10
1611924,168,2022-06-11
1611924,168,2022-06-12
1611924,168,2022-06-13
CodePudding user response:
This is best done using groupby operations. When starting out it might be tempting to do this with a loop but that will seriously impact performance as the number of rows grows.
Using groupby methods .first() and .last() gives you the first and last entry. Note that you need to ensure that your dataframe is sorted by whichever column designate your date for this to work.
.concat is just an easy way to put the results together into one df. There might be nicer options.
This snippet:
pd.concat([
df.groupby('item_id').first(),
df.groupby('item_id').last()
]).sort_index()
Gives me this output:
favorite date
item_id
1049977 8 2022-06-09
1049977 8 2022-06-12
1611924 168 2022-06-09
1611924 168 2022-06-13
CodePudding user response:
You can filter the group with max, min date
value in groupby.apply
or in transform(min)
and transform(max)
out = (df.groupby('item_id')
.apply(lambda g: g[g['date'].isin([g['date'].min(), g['date'].max()])])
.reset_index(drop=True))
# or
groups = df.groupby(['item_id'])['date']
out = df[(df['date']==groups.transform(min)) | (df['date']==groups.transform(max))]
print(out)
item_id favorite date
0 1049977 8 2022-06-09
3 1049977 8 2022-06-12
4 1611924 168 2022-06-09
8 1611924 168 2022-06-13
CodePudding user response:
Here is a solution that does not require the input dataframe to be sorted. For each item_id
, it obtains the index values of the rows where dates are either earliest (minimum) or latest (maximum) and then filters the original dataframe using those index values.
df.iloc[df.groupby('item_id').date.agg(['idxmin','idxmax']).values.ravel()]
item_id favorite date
0 1049977 8 2022-06-09
3 1049977 8 2022-06-12
4 1611924 168 2022-06-09
8 1611924 168 2022-06-13
CodePudding user response:
df[df.groupby('item_id')['date'].transform(lambda x: (x.max() == x) | (x.min() == x))]
Output:
item_id favorite date
0 1049977 8 2022-06-09
3 1049977 8 2022-06-12
4 1611924 168 2022-06-09
8 1611924 168 2022-06-13
CodePudding user response:
Alternative method:
df=pd.DataFrame({'item_id': [1049977, 1049977, 1049977, 1049977,1611924,1611924,1611924,1611924,1611924],
'favorite': [8,8,8,8,168,168,168,168,168],
'date': ['2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12',
'2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12', '2022-06-13']})
df['date'] = pd.to_datetime(df['date'])
df['nrows'] = df.groupby(['item_id', 'favorite']).cumcount() 1
df['max_row'] = df.groupby(['item_id', 'favorite'])['nrows'].transform('max')
df['min_row'] = df.groupby(['item_id', 'favorite'])['nrows'].transform('min')
df['flag_min'] = df.apply(lambda x: 1 if (x['nrows']==x['min_row']) else 0, axis = 1)
df['flag_max'] = df.apply(lambda x: 1 if (x['nrows']==x['max_row']) else 0, axis = 1)
df1 = df[(df['flag_min']== 0) & (df['flag_max']== 0)].drop(['nrows','max_row', 'min_row', 'flag_min', 'flag_max'], axis = 1)
print(df1)