Home > front end >  Panda get value from column based on max condition to get proper cluster names
Panda get value from column based on max condition to get proper cluster names

Time:12-04

I've successfully clustered my data and am presented with the following dataframe:

     cluster_group  name value
  0              1     A    20 
  1              1     B    30 
  2              1     C    10 
  3              1     D    50 
  4              2     E    20 
  5              2     F    10 
...

What I want for better exporting, is to give the cluster_group a name instead of a integer. The name should be based on the name column with the highest value. So the result should look like this:

     cluster_name  name value
  0             D     A    20 
  1             D     B    30 
  2             D     C    10 
  3             D     D    50 
  4             E     E    20 
  5             E     F    10 
...

How would I do this in the most efficient way?

CodePudding user response:

If name are unique per groups get always unique groups by DataFrameGroupBy.idxmax per groups in GroupBy.transform:

df['cluster_group'] = (df.set_index('name')
                         .groupby('cluster_group')['value']
                         .transform('idxmax')
                         .to_numpy())
print (df)
  cluster_group name  value
0             D    A     20
1             D    B     30
2             D    C     10
3             D    D     50
4             E    E     20
5             E    F     10

If possible multiple same names is possible get same clusters, so some groups should be joined together:

print (df)
   cluster_group name  value
0              1    A     20
1              1    E    300 <- max per group 1 is E
2              1    C     10
3              1    D     50
4              2    E     20  <- max per group 2 is E
5              2    F     10

df['cluster_group'] = (df.set_index('name')
                         .groupby('cluster_group')['value']
                         .transform('idxmax')
                         .to_numpy())
print (df)
  cluster_group name  value
0             E    A     20
1             E    E    300
2             E    C     10
3             E    D     50
4             E    E     20
5             E    F     10
  • Related