I have the following Python code that creates a DataFrame with a combination of parameters for a specified clustering algorithm.
The function is called as follows:
fixed_params = {"random_state": 1234}
param_grid = {"n_clusters": range(2,4), "max_iter": [200, 300]}
dataset = myGridSearch(df, fixed_params, param_grid, "KMeans")
print(dataset)
The function returns the next resulting pandas DataFrame:
| params | num_cluster | silhouette |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | ---------- |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means ', 'max_iter': 200, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 2 | 0.854996 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means ', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 2 | 0.854996 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means ', 'max_iter': 200, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 3 | 0.742472 |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means ', 'max_iter': 300, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0} | 3 | 0.742472 |
I would like that once this DataFrame is obtained, the column 'param' only contains the information for the parameters that are changing, that is, the ones stored in grid_param. An idea of the resulting DataFrame would be the following:
| params | num_cluster | silhouette |
| ------------------------------------- | ------------ | ---------- |
| {'max_iter': 200, 'n_clusters': 2} | 2 | 0.854996 |
| {'max_iter': 300, 'n_clusters': 2} | 2 | 0.854996 |
| {'max_iter': 200, 'n_clusters': 3} | 3 | 0.742472 |
| {'max_iter': 300, 'n_clusters': 3} | 3 | 0.742472 |
If you need to send me the code for the myGridSearch function, let me know in the comments.
CodePudding user response:
IIUC, you can use pandas.json_normalize
to create multiple columns from "params", then filter the non-unique values using nunique
and boolean indexing, finally convert back to_dict
:
df2 = pd.json_normalize(dataset['params'])
dataset['params'] = pd.Series(df2.loc[:, df2.nunique().gt(1)]
.to_dict(orient='index'))
output:
params num_cluster silhouette
0 {'max_iter': 200, 'n_clusters': 2} 2 0.854996
1 {'max_iter': 300, 'n_clusters': 2} 2 0.854996
2 {'max_iter': 200, 'n_clusters': 3} 3 0.742472
3 {'max_iter': 300, 'n_clusters': 3} 3 0.742472
intermediate:
df2.nunique()
algorithm 1
copy_x 1
init 1
max_iter 2
n_clusters 2
n_init 1
random_state 1
tol 1
verbose 1
dtype: int64