Home > Software engineering >  Pandas Groupby and remove rows with numbers that are close to each other
Pandas Groupby and remove rows with numbers that are close to each other

Time:10-31

I have a data frame df

df = 

Code  Bikes  Year
12    356    2020 
4     378    2020 
2     389    2020
35    378    2021
40    370    2021
32    350    2021

I would like to group the data frame based on Year using df.groupby('Year') and check the close values in column df['Çode'] to find values that are close by at least 3 and retain the row with a maximum value in the df['Bikes'] column out of them.

For instance, in the first group of 2020, values 4 and 2 are close by at least 3 since 4-2=2 ≤ 3 and since 389 (df['Bikes']) corresponding to df['Code'] = 2 is the highest among the two, retain that and drop row where df['code']=4.

The expected out for the given example:

Code  Bikes  Year
12    356    2020  
2     389    2020
35    378    2021
40    370    2021

CodePudding user response:

new_df = df.sort_values(['Year', 'Code'])
new_df['diff'] = new_df['Code'] - new_df.groupby('Year')['Code'].shift()
new_df['cumsum'] = ((new_df['diff'] > 3) | (new_df['diff'].isna())).cumsum()
new_df = new_df.sort_values('Bikes', ascending=False).drop_duplicates(['cumsum']).sort_index()
new_df.drop(columns=['diff', 'cumsum'], inplace=True)

CodePudding user response:

You can first sort values per both columns by DataFrame.sort_values, then create groups by compare differences with treshold 3 and cumulative sum, last use DataFrameGroupBy.idxmax for get maximal indices by Bikes per Year and helper Series:

df1 = df.sort_values(['Year','Code'])
g = df1.groupby('Year')['Code'].diff().gt(3).cumsum()

df2 = df.loc[df1.groupby(['Year', g])['Bikes'].idxmax()].sort_index()
print (df2)
   Code  Bikes  Year
0    12    356  2020
2     2    389  2020
3    35    378  2021
4    40    370  2021
  • Related