I have a data frame df
df =
Code Bikes Year
12 356 2020
4 378 2020
2 389 2020
35 378 2021
40 370 2021
32 350 2021
I would like to group the data frame based on Year using df.groupby('Year') and check the close values in column df['Çode'] to find values that are close by at least 3 and retain the row with a maximum value in the df['Bikes'] column out of them.
For instance, in the first group of 2020, values 4 and 2 are close by at least 3 since 4-2=2 ≤ 3 and since 389 (df['Bikes']) corresponding to df['Code'] = 2 is the highest among the two, retain that and drop row where df['code']=4.
The expected out for the given example:
Code Bikes Year
12 356 2020
2 389 2020
35 378 2021
40 370 2021
CodePudding user response:
new_df = df.sort_values(['Year', 'Code'])
new_df['diff'] = new_df['Code'] - new_df.groupby('Year')['Code'].shift()
new_df['cumsum'] = ((new_df['diff'] > 3) | (new_df['diff'].isna())).cumsum()
new_df = new_df.sort_values('Bikes', ascending=False).drop_duplicates(['cumsum']).sort_index()
new_df.drop(columns=['diff', 'cumsum'], inplace=True)
CodePudding user response:
You can first sort values per both columns by DataFrame.sort_values
, then create groups by compare differences with treshold 3
and cumulative sum, last use DataFrameGroupBy.idxmax
for get maximal indices by Bikes
per Year
and helper Series:
df1 = df.sort_values(['Year','Code'])
g = df1.groupby('Year')['Code'].diff().gt(3).cumsum()
df2 = df.loc[df1.groupby(['Year', g])['Bikes'].idxmax()].sort_index()
print (df2)
Code Bikes Year
0 12 356 2020
2 2 389 2020
3 35 378 2021
4 40 370 2021