I have a pandas dataframe with dates and values.
import pandas as pd
df = pd.DataFrame({'date':['11-10','11-10','11-10','12-10','12-10','12-10'],
'id': [1, 1, 2, 1, 1, 2],
'val':[20, 30, 40, 50, 25, 35] })
I'd like the filter the DataFrame to only include top and bottom N rows for each date. Let's say N = 2. Then, the dataframe would discard the 1st row 11-10 20
and 5th row 12-10 25
.
Hoping for a solution that can scale for different values of N
.
CodePudding user response:
You can group the dataframe by date
then call nlargest
for val
column, passing the value of N
:
>>> df.groupby('date')['val'].nlargest(2)
date
11-10 2 40
1 30
12-10 3 50
5 35
Name: val, dtype: int64
If needed, you can call to_frame()
on the resulting Series to convert it to a dataframe.
For the updated question, you can still implement the above code with some additional works to get the ids as well, then inner merge with the original dataframe:
out= (df.set_index('id')
.groupby(['date'])['val']
.nlargest(2)
.to_frame('val')
.reset_index()
.merge(df, how='inner')
)
OUTPUT:
date id val
0 11-10 2 40
1 11-10 1 30
2 12-10 1 50
3 12-10 2 35
CodePudding user response:
You can do:
index_of_N_greatest_vals = df.groupby('date')['val'].nlargest(N).reset_index()['level_1']
df1 = df[df.index.isin(index_of_N_greatest_vals)]
Output:
date id val
1 11-10 1 30
2 11-10 2 40
3 12-10 1 50
5 12-10 2 35
CodePudding user response:
One option is to sort the columns and take the tail on the groupby:
(df.sort_values(['date', 'val'])
.groupby('date', sort = False)
.tail(n=2)
)
date id val
1 11-10 1 30
2 11-10 2 40
5 12-10 2 35
3 12-10 1 50