Pandas - Keeping rows with determined values and droping the rest-CodePudding

So, I have a data frame like this (the important column is the third one):

   |  ABC  |  DEF  |  fruit |
----------------------------
1  |  12   |  LO   | banana
2  |  45   |  KA   | orange
3  |  65   |  JU   | banana
4  |  25   |  UY   | grape
5  |  23   |  TE   | apple
6  |  28   |  YT   | orange
7  |  78   |  TR   | melon

I want to keep the rows that have the 5 most occurring fruits and drop the rest, so I made a variable to hold those fruits to keep in a list, like this:

fruits = df['fruit'].value_counts()
fruits_to_keep = fruits[:5].reset_index()
fruits_to_keep.drop(['fruit'], inplace=True, axis=1)
fruits_to_keep = fruits_to_keep.to_numpy()
fruits_to_keep = fruits_to_keep.tolist()
fruits_to_keep

[['banana'],['orange'],[apple],[melon],[grape]]

I have the feeling that I made unnecessary steps, but anyway, the problem arises when I try to select the rows containing those fruits_to_keep

df = df.set_index('fruit')
df = df.loc[fruits_to_keep,:]

Then I get the Key Error saying that "None of [Index([('banana',), \n ('orange',), \n ('apple',)...... dtype='object', name='fruit')] are in the [index]"

I also tried:

df[df.fruit in fruits_to_keep]

But then I get the following error: ('Lengths must match to compare', (43987,), (1,))

Obs.: I actually have 43k rows, many 'fruits' that I don't want on the dataframe and 30k rows with the 5 most occurring 'fruits'

Thanks in advance!

CodePudding user response：

To keep the rows with the top N values you can use value_counts and isin.

By default, value_counts returns the elements in descending order of frequency.

N = 5
df[df['col'].isin(df['col'].value_counts().index[:N])]