Retaining pandas dataframe rows which have maximum number of item occurrences in a column-CodePudding

I have a pandas dataframe

import pandas as pd

df =pd.DataFrame({'name':['john','joe','bill','richard','sam'],
                  'cluster':['1','2','3','1','2']})

df['cluster'].value_counts() will give the number of occurrences of items based on the column cluster.

Is it possible to retain only the rows which have the maximum number of occurrences in the column cluster?

The expected output is

The cluster 1 and 2 have the same number of occurrences, so all the rows for cluster 1 and 2 need to be retained.

CodePudding user response：

Use this

# find the most common clusters then filter those clusters
df[df.cluster.isin(df.cluster.mode())]

CodePudding user response：

Group by 'cluster' and use transform('count') to get a Series of occurrences by clusters with the appropriate shape. Then use it to mask only the rows corresponding to the max occurrences.

cluster_counts = df.groupby('cluster')['name'].transform('count')
res = df[cluster_counts == cluster_counts.max()]

Output:

>>> res

      name cluster
0     john       1
1      joe       2
3  richard       1
4      sam       2

Setup:

import pandas as pd

df = pd.DataFrame({'name':['john','joe','bill','richard','sam'],
                   'cluster':['1','2','3','1','2']})

CodePudding user response：

You can get the max count of cluster value through df['cluster'].value_counts() then use isin to filter cluster column

c = df['cluster'].value_counts()

out = df[df['cluster'].isin(c[c.eq(c.max())].index)]

print(out)

      name cluster
0     john       1
1      joe       2
3  richard       1
4      sam       2