I have a pandas dataframe
import pandas as pd
df =pd.DataFrame({'name':['john','joe','bill','richard','sam'],
'cluster':['1','2','3','1','2']})
df['cluster'].value_counts()
will give the number of occurrences of items based on the column cluster
.
Is it possible to retain only the rows which have the maximum number of occurrences in the column cluster
?
The expected output is
The cluster 1 and 2 have the same number of occurrences, so all the rows for cluster 1 and 2 need to be retained.
CodePudding user response:
Use this
# find the most common clusters then filter those clusters
df[df.cluster.isin(df.cluster.mode())]
CodePudding user response:
Group by 'cluster' and use transform('count')
to get a Series of occurrences by clusters with the appropriate shape. Then use it to mask only the rows corresponding to the max occurrences.
cluster_counts = df.groupby('cluster')['name'].transform('count')
res = df[cluster_counts == cluster_counts.max()]
Output:
>>> res
name cluster
0 john 1
1 joe 2
3 richard 1
4 sam 2
Setup:
import pandas as pd
df = pd.DataFrame({'name':['john','joe','bill','richard','sam'],
'cluster':['1','2','3','1','2']})
CodePudding user response:
You can get the max count of cluster
value through df['cluster'].value_counts()
then use isin
to filter cluster
column
c = df['cluster'].value_counts()
out = df[df['cluster'].isin(c[c.eq(c.max())].index)]
print(out)
name cluster
0 john 1
1 joe 2
3 richard 1
4 sam 2