Home > Software engineering >  Retaining pandas dataframe rows which have maximum number of item occurances in a column
Retaining pandas dataframe rows which have maximum number of item occurances in a column

Time:05-30

I have a pandas dataframe

import pandas as pd

df =pd.DataFrame({'name':['john','joe','bill','richard','sam'],
                  'cluster':['1','2','3','1','2']})

df['cluster'].value_counts() will give the number of occurance of items based on column cluster.

Is it possible to retain the only rows which have maximum number of occurance in the column cluster

The expected output is

enter image description here

The cluster 1 and 2 have same number of occurances, so all the rows for cluster 1 and 2 needs to be retained

CodePudding user response:

Group by 'cluster' and use transform('count') to get a Series of occurrences by clusters with the appropriate shape. Then use it to mask only the rows corresponding to the max occurrences.

cluster_counts = df.groupby('cluster')['name'].transform('count')
res = df[cluster_counts == cluster_counts.max()]

Output:

>>> res

      name cluster
0     john       1
1      joe       2
3  richard       1
4      sam       2

Setup:

import pandas as pd

df = pd.DataFrame({'name':['john','joe','bill','richard','sam'],
                   'cluster':['1','2','3','1','2']})

CodePudding user response:

Use this

# find the most common clusters then filter those clusters
df[df.cluster.isin(df.cluster.mode())]

enter image description here

CodePudding user response:

You can get the max count of cluster value through df['cluster'].value_counts() then use isin to filter cluster column

c = df['cluster'].value_counts()

out = df[df['cluster'].isin(c[c.eq(c.max())].index)]
print(out)

      name cluster
0     john       1
1      joe       2
3  richard       1
4      sam       2
  • Related