Let it be the following Python Panda DataFrame. I want for each num_clusters
group, to keep the row with the highest calinski_harabasz score.
params num_clusters calinski_harabasz
0 {'max_iter': 200, 'n_clusters': 2} 2 4209.697651
1 {'max_iter': 700, 'n_clusters': 2} 2 4209.697651
2 {'max_iter': 200, 'n_clusters': 3} 3 5334.854274
3 {'max_iter': 700, 'n_clusters': 3} 3 5436.854274
4 {'max_iter': 200, 'n_clusters': 4} 4 4702.497651
5 {'max_iter': 700, 'n_clusters': 4} 4 4709.697651
6 {'max_iter': 200, 'n_clusters': 5} 5 5636.855534
7 {'max_iter': 700, 'n_clusters': 5} 5 5736.854274
First resulting DataFrame: In case two rows have the same num_clusters
and calinski value, delete any of them.
params num_clusters calinski_harabasz
0 {'max_iter': 200, 'n_clusters': 2} 2 4209.697651
3 {'max_iter': 700, 'n_clusters': 3} 3 5436.854274
5 {'max_iter': 700, 'n_clusters': 4} 4 4709.697651
7 {'max_iter': 700, 'n_clusters': 5} 5 5736.854274
Then, assuming that the num_cluster
values are sorted in ascending order. I will return the first row such that the next value is lower than its value.
params num_clusters calinski_harabasz
3 {'max_iter': 700, 'n_clusters': 3} 3 5436.854274
If I had the same problem but instead of the maximum you wanted to take into account the minimum, how would you do it? That is, in the first part, to keep the smallest of each group. In the second part, keep the value such that the next one is greater than the current one.
CodePudding user response:
IIUC, you can use:
out = (df
# min of each group
.loc[df.groupby('num_clusters')['calinski_harabasz'].idxmin()]
# score > to next one
.loc[lambda d: d['calinski_harabasz'].gt(d['calinski_harabasz'].shift())]
.head(1) # first row
)
output:
params num_clusters calinski_harabasz
2 {'max_iter': 200, 'n_clusters': 3} 3 5334.854274
CodePudding user response:
Probably not as good as @mozway (since he is literally a super computer in Silicon Valley haha), but this will also give you the results you want
df['Max'] = df.groupby('num_clusters')['calinski_harabasz'].transform(max)
df['Check'] = np.where(df['calinski_harabasz'] == df['Max'], True, False)
df = df.loc[df['Check'] == True]
df = df.groupby('num_clusters').first().reset_index()
df['Max_Shift'] = df['Max'].shift(-1)
df['Check'] = np.where(df['calinski_harabasz'] > df['Max_Shift'], True, False)
df = df.loc[df['Check'] == True]
df = df.groupby('Check').first().reset_index()
df