So, I have a data frame like this (the important column is the third one):
| ABC | DEF | fruit |
----------------------------
1 | 12 | LO | banana
2 | 45 | KA | orange
3 | 65 | JU | banana
4 | 25 | UY | grape
5 | 23 | TE | apple
6 | 28 | YT | orange
7 | 78 | TR | melon
I want to keep the rows that have the 5 most occurring fruits and drop the rest, so I made a variable to hold those fruits to keep in a list, like this:
fruits = df['fruit'].value_counts()
fruits_to_keep = fruits[:5].reset_index()
fruits_to_keep.drop(['fruit'], inplace=True, axis=1)
fruits_to_keep = fruits_to_keep.to_numpy()
fruits_to_keep = fruits_to_keep.tolist()
fruits_to_keep
[['banana'],['orange'],[apple],[melon],[grape]]
I have the feeling that I made unnecessary steps, but anyway, the problem arises when I try to select the rows containing those fruits_to_keep
df = df.set_index('fruit')
df = df.loc[fruits_to_keep,:]
Then I get the Key Error saying that "None of [Index([('banana',), \n ('orange',), \n ('apple',)...... dtype='object', name='fruit')] are in the [index]"
I also tried:
df[df.fruit in fruits_to_keep]
But then I get the following error: ('Lengths must match to compare', (43987,), (1,))
Obs.: I actually have 43k rows, many 'fruits' that I don't want on the dataframe and 30k rows with the 5 most occurring 'fruits'
Thanks in advance!
CodePudding user response:
To keep the rows with the top N values you can use value_counts
and isin
.
By default, value_counts
returns the elements in descending order of frequency.
N = 5
df[df['col'].isin(df['col'].value_counts().index[:N])]