Get the maximum value per group and elbow point of a sequence of values in a DataFrame-CodePudding

Let it be the following Python Panda DataFrame. I want for each num_clusters group, to keep the row with the highest calinski_harabasz score.

                              params num_clusters  calinski_harabasz 
0  {'max_iter': 200, 'n_clusters': 2}            2        4209.697651  
1  {'max_iter': 700, 'n_clusters': 2}            2        4209.697651    
2  {'max_iter': 200, 'n_clusters': 3}            3        5334.854274  
3  {'max_iter': 700, 'n_clusters': 3}            3        5436.854274 
4  {'max_iter': 200, 'n_clusters': 4}            4        4702.497651   
5  {'max_iter': 700, 'n_clusters': 4}            4        4709.697651   
6  {'max_iter': 200, 'n_clusters': 5}            5        5636.855534    
7  {'max_iter': 700, 'n_clusters': 5}            5        5736.854274

First resulting DataFrame: In case two rows have the same num_clusters and calinski value, delete any of them.

                              params num_clusters  calinski_harabasz 
0  {'max_iter': 200, 'n_clusters': 2}            2        4209.697651    
3  {'max_iter': 700, 'n_clusters': 3}            3        5436.854274  
5  {'max_iter': 700, 'n_clusters': 4}            4        4709.697651    
7  {'max_iter': 700, 'n_clusters': 5}            5        5736.854274

Then, assuming that the num_cluster values are sorted in ascending order. I will return the first row such that the next value is lower than its value.

                              params num_clusters  calinski_harabasz    
3  {'max_iter': 700, 'n_clusters': 3}            3        5436.854274

If I had the same problem but instead of the maximum you wanted to take into account the minimum, how would you do it? That is, in the first part, to keep the smallest of each group. In the second part, keep the value such that the next one is greater than the current one.

CodePudding user response：

IIUC, you can use:

out = (df
   # min of each group
   .loc[df.groupby('num_clusters')['calinski_harabasz'].idxmin()]
   # score > to next one
   .loc[lambda d: d['calinski_harabasz'].gt(d['calinski_harabasz'].shift())]
   .head(1) # first row
)

output:

                               params  num_clusters  calinski_harabasz
2  {'max_iter': 200, 'n_clusters': 3}             3        5334.854274

CodePudding user response：

Probably not as good as @mozway (since he is literally a super computer in Silicon Valley haha), but this will also give you the results you want

df['Max'] = df.groupby('num_clusters')['calinski_harabasz'].transform(max)
df['Check'] = np.where(df['calinski_harabasz'] == df['Max'], True, False)
df = df.loc[df['Check'] == True]
df = df.groupby('num_clusters').first().reset_index()
df['Max_Shift'] = df['Max'].shift(-1)
df['Check'] = np.where(df['calinski_harabasz'] > df['Max_Shift'], True, False)
df = df.loc[df['Check'] == True]
df = df.groupby('Check').first().reset_index()
df