Reduce the values of a dictionary included as a column in a pandas DataFrame-CodePudding

I have the following Python code that creates a DataFrame with a combination of parameters for a specified clustering algorithm.

The function is called as follows:

fixed_params = {"random_state": 1234} 
param_grid = {"n_clusters": range(2,4), "max_iter": [200, 300]}

dataset = myGridSearch(df, fixed_params, param_grid, "KMeans")
print(dataset)

The function returns the next resulting pandas DataFrame:

| params                                                                                                                                                           | num_cluster  | silhouette |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | ---------- |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means  ', 'max_iter': 200, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 2            | 0.854996   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means  ', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 2            | 0.854996   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means  ', 'max_iter': 200, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 3            | 0.742472   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means  ', 'max_iter': 300, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 3            | 0.742472   |

I would like that once this DataFrame is obtained, the column 'param' only contains the information for the parameters that are changing, that is, the ones stored in grid_param. An idea of the resulting DataFrame would be the following:

| params                                | num_cluster  | silhouette |
| ------------------------------------- | ------------ | ---------- |
| {'max_iter': 200, 'n_clusters': 2}    | 2            | 0.854996   | 
| {'max_iter': 300, 'n_clusters': 2}    | 2            | 0.854996   | 
| {'max_iter': 200, 'n_clusters': 3}    | 3            | 0.742472   | 
| {'max_iter': 300, 'n_clusters': 3}    | 3            | 0.742472   |

If you need to send me the code for the myGridSearch function, let me know in the comments.

CodePudding user response：

IIUC, you can use pandas.json_normalize to create multiple columns from "params", then filter the non-unique values using nunique and boolean indexing, finally convert back to_dict:

df2 = pd.json_normalize(dataset['params'])
dataset['params'] = pd.Series(df2.loc[:, df2.nunique().gt(1)]
                                 .to_dict(orient='index'))

output:

                               params  num_cluster  silhouette
0  {'max_iter': 200, 'n_clusters': 2}            2    0.854996
1  {'max_iter': 300, 'n_clusters': 2}            2    0.854996
2  {'max_iter': 200, 'n_clusters': 3}            3    0.742472
3  {'max_iter': 300, 'n_clusters': 3}            3    0.742472

intermediate:

df2.nunique()

algorithm       1
copy_x          1
init            1
max_iter        2
n_clusters      2
n_init          1
random_state    1
tol             1
verbose         1
dtype: int64