How to create a for loop with checking appended models-CodePudding

I have a list of models that I iterate through in a for loop getting their performances. I've added catboost to my model list, but when I try to add it's best estimator to a dictionary it gives me an error no other models give me (TypeError: unhashable type: 'CatBoostRegressor'). Googling and I can't see a clear way around this error, so I've been trying to add an if statement into my for loop that if the model is catboost then ignore putting it's best estimator into a dictionary.

An example of the code I'm running is this:

lgbm = LGBMRegressor(random_state=seed)
lgbm_params = {
    "max_depth": (1, 4),
    "learning_rate": (0.01, 0.2, "log-uniform"),
    "n_estimators": (10, 50),
    "reg_alpha": (1, 10, "log-uniform"),
    "reg_lambda": (1, 10, "log-uniform"),
}

catboost = CatBoostRegressor(random_seed=seed, verbose=False)
cat_params = {
     "iterations": (10, 50),
     'learning_rate': (0.01, 0.2, 'log-uniform'), 
     'depth':  (1, 4), 
}

inner_cv = KFold(n_splits=2, random_state=seed)
outer_cv = KFold(n_splits=2, random_state=seed)

models = []

models.append(("CB", BayesSearchCV(catboost, cat_params, cv=inner_cv, iid=False, n_jobs=1)))
models.append(("LGBM", BayesSearchCV(lgbm, lgbm_params, cv=inner_cv, iid=False, n_jobs=1)))


results = []
names = []
medians =[]
scoring = ['r2', 'neg_mean_squared_error', 'max_error', 'neg_mean_absolute_error',
          'explained_variance','neg_root_mean_squared_error',
           'neg_median_absolute_error'] 

models_dictionary_r2 = {}
models_dictionary_mse = {}

for name, model in models:

    #run nested cross-validation

    nested_cv_results = model_selection.cross_validate(model, X , Y, cv=outer_cv, scoring=scoring, error_score="raise")
    nested_cv_results2 = model_selection.cross_val_score(model, X , Y, cv=outer_cv, scoring='r2', error_score="raise")
    results.append(nested_cv_results2)
    names.append(name)
    medians.append(np.median(nested_cv_results['test_r2']))
    print(name, 'Nested CV results for all scores:', '\n', nested_cv_results, '\n')
    print(name, 'r2 Nested CV Median', np.median(nested_cv_results['test_r2']))
    print(name, 'MSE Nested CV Median', np.median(nested_cv_results['test_neg_mean_squared_error'] ))

    #view best tuned model

    model.fit(X_train, Y_train)
    print("Best Parameters: \n{}\n".format(model.best_params_))
    y_pred_train = model.best_estimator_.predict(X_train)
    y_pred = model.best_estimator_.predict(X_test)
 
    #view shap interpretation of best tuned model

    explainer = shap.TreeExplainer(model.best_estimator_)
    shap_values = explainer.shap_values(X_importance)
    X_importance = pd.DataFrame(data=X_test, columns=df3.columns)
    print(name,'ALL FEATURES Ranked SHAP Importance:', X.columns[np.argsort(np.abs(shap_values).mean(0))[::-1]])
    fig, ax = plt.subplots()
    shap.summary_plot(shap_values, X_importance)
    fig.savefig("shap_summary"   name  ".svg", format='svg', dpi=1200, bbox_inches = "tight")

    #add model's best estimator's best metrics to a dictionary, but ignore this for catboost 

    if model is models[0]: 
        print('catboost best estimator not compatible with entering a dictionary')
    else:
        models_dictionary_r2[model.best_estimator_] = np.median(nested_cv_results['test_r2'])
        models_dictionary_mse[model.best_estimator_] = np.median(nested_cv_results['test_neg_mean_squared_error']

It's the if statement at the end here that I am trying to get working, but I'm not experience using python with conditional statements. At moment this runs and still sends the catboost model to try putting it's results in the dictionary and I get the same TypeError: unhashable type: 'CatBoostRegressor' - is there a way I can code 'if model is catboost then move on to test the next model, else store best estimator results in dictionaries'?

Unfortunately I can't provide my data but it's just 8 features of continuous variables with regression models scoring rows between 0-1.

Edit: I am doing this to then get the top performing model's best estimator so that I can then fit that specific/tuned model to new data.

I pull it out singularly from the dictionary to fit it to new data like this:

top_model = max(models_dictionary_r2, key=models_dictionary_r2.get)

From the answer currently, I am running to get a list which outputs like this:

[(<catboost.core.CatBoostRegressor at 0x7f8d50860400>, 0.8110325480633154),
 (LGBMRegressor(learning_rate=0.14567200981008144, max_depth=3, n_estimators=50,
                random_state=0, reg_alpha=1, reg_lambda=1),
  0.7632660705322947)]

Catboost has the best median r2 in this list, but I'm not sure if catboost is in the right format to have it's best estimator details to be fit to new data? I tried:

top_model = models_list_predr2[0]
top_model.fit(X_train, Y_train)
AttributeError: 'tuple' object has no attribute 'fit'

How can I pull out the best_estimator_ of the top performing model from this list and be sure this works for catboost?

I'm not experienced in python and trying max(models_list_predr2) with the above list also gives the error TypeError: '>' not supported between instances of 'LGBMRegressor' and 'CatBoostRegressor'

CodePudding user response：

This error happens because any dictionary key should belong to hashable type, which means it should implement __hash__() for hashing method and a __eq__() for comparison.

Since CatBoostRegressor doesn't implement these methods, you are receiving an exception while trying to add CatBoostRegressor as a key into the dictionary.

I'd suggest you to use a list instead of the dictionary for models_dictionary_r2 and models_dictionary_mse.

models_list_r2 = []
models_list_mse = []

and then you can add values to these lists like this:

best_estimator = model.best_estimator_
median_r2 = np.median(nested_cv_results['test_r2'])
models_list_r2.append((best_estimator,  median_r2))

median_mse = np.median(nested_cv_results['test_neg_mean_squared_error'])
models_list_mse.append((model.best_estimator_, median_mse))

to select the model with the highest R-squared you can add the following code:

best_model, best_r2 = sorted(models_list_r2, key = lambda x: x[1], reverse=True)[0]