I've successfully clustered my data and am presented with the following dataframe:
cluster_group name value
0 1 A 20
1 1 B 30
2 1 C 10
3 1 D 50
4 2 E 20
5 2 F 10
...
What I want for better exporting, is to give the cluster_group a name instead of a integer. The name should be based on the name column with the highest value. So the result should look like this:
cluster_name name value
0 D A 20
1 D B 30
2 D C 10
3 D D 50
4 E E 20
5 E F 10
...
How would I do this in the most efficient way?
CodePudding user response:
If name are unique per groups get always unique groups by DataFrameGroupBy.idxmax
per groups in GroupBy.transform
:
df['cluster_group'] = (df.set_index('name')
.groupby('cluster_group')['value']
.transform('idxmax')
.to_numpy())
print (df)
cluster_group name value
0 D A 20
1 D B 30
2 D C 10
3 D D 50
4 E E 20
5 E F 10
If possible multiple same names is possible get same clusters
, so some groups should be joined together:
print (df)
cluster_group name value
0 1 A 20
1 1 E 300 <- max per group 1 is E
2 1 C 10
3 1 D 50
4 2 E 20 <- max per group 2 is E
5 2 F 10
df['cluster_group'] = (df.set_index('name')
.groupby('cluster_group')['value']
.transform('idxmax')
.to_numpy())
print (df)
cluster_group name value
0 E A 20
1 E E 300
2 E C 10
3 E D 50
4 E E 20
5 E F 10