LassoCV getting axis -1 is out of bounds for array of dimension 0 and other questions-CodePudding

Good evening to all,

I am trying to implement for the first time LassoCV with sklearn.

My code is as follows:

numeric_features = ['AGE_2019', 'Inhabitants'] categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']

numeric_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='median'))
      ,('scaler', MinMaxScaler()) # Centrage des données       ])

categorical_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='constant',fill_value='missing'))       
      ,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques ])

preprocessor = ColumnTransformer(    transformers=[
    ('numeric', numeric_transformer, numeric_features)    ,('categorical', categorical_transformer, categorical_features) ]) 

# Creation of the pipeline 

lassocv_piped = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LassoCV())
    ])

# Creation of the grid of parameters

dt_params = {'model__alphas': np.array([0.5])
             }

cv_folds = KFold(n_splits=5, shuffle=True, random_state=0)

lassocv_grid_piped = GridSearchCV(lassocv_piped, dt_params, cv=cv_folds, n_jobs=-1, scoring=['neg_mean_squared_error', 'r2'], refit='r2')  
# Fitting our model

lassocv_grid_piped.fit(df_X_train,df_Y_train.values.ravel())

# Getting our metrics and predictions

Y_pred_lassocv = lassocv_grid_piped.predict(df_X_test)

metrics_lassocv = lassocv_grid_piped.cv_results_ best_lassocv_parameters = lassocv_grid_piped.best_params_


print('Best test negatif MSE of the base model : ', max(metrics_lassocv['mean_test_neg_mean_squared_error'])) print('Best test R^2 of the base model : ', max(metrics_lassocv['mean_test_r2'])) print('Best parameters of the base model : ', best_lassocv_parameters)

# Graphique representation

results = pd.DataFrame(dt_params) for k in range(5):
    results = pd.concat([results,
                         pd.DataFrame(lassocv_grid_piped.cv_results_['split' str(k) '_test_neg_mean_squared_error'])],axis=1)
                         sns.relplot(data=results.melt('model__alphas',value_name='neg_mean_squared_error'),x='model__alphas',y='neg_mean_squared_error',kind='line')

I am still a novice when it comes to using this model. So, I have some questions about the use of this estimator:

Is it useful to use a cv_fold outside the estimator, as I do?
Is it useful to set up a GridSearchCV to test the different alpha values?
How is it possible to extract the R^2 from our model?

Also, I encounter this error:

AxisError: axis -1 is out of bounds for array of dimension 0

Would you have an idea to solve it?

I wish you a good evening!

CodePudding user response：

After a good night's sleep, I was able to overcome some of my problems.

Is it useful to use a cv_fold outside the estimator, as I do ?

After studying the documentation of LassoCV a bit, it seems not. So I could remove cv_fold from my code. Instead, I could use the cv argument of LassoCV.

Is it useful to set up a GridSearchCV to test the different alpha values?

I haven't really been able to answer that question yet. It seems that LassoCV does it by itself.

How is it possible to extract the R^2 from our model ?

This can be done simply with the function: .score(X,y).

As for my error message. I was able to get rid of it once I deleted GridSearchCV.

Here's my final code :

numeric_features = ['AGE_2019', 'Inhabitants']
categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
    
numeric_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='median'))
      ,('scaler', MinMaxScaler()) # Centrage des données      
])

categorical_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='constant',fill_value='missing'))       
      ,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques
])

preprocessor = ColumnTransformer(
   transformers=[
    ('numeric', numeric_transformer, numeric_features)
   ,('categorical', categorical_transformer, categorical_features)
]) 

# Creation of the pipeline 
list_metrics_lassocv = []
list_best_lassocv_parameters = []

for i in range (1,12) : 
    lassocv_piped = Pipeline([
        ('preprocessor', preprocessor),
        ('model', LassoCV(cv=5, n_alphas=i, random_state=0))
        ])



# Fitting our model

    lassocv_piped.fit(df_X_train,df_Y_train.values.ravel())

# Getting our metrics and predictions

    Y_pred_lassocv = lassocv_piped.predict(df_X_test)

    metrics_lassocv = lassocv_piped.score(df_X_train,df_Y_train.values.ravel())
    best_lassocv_parameters = lassocv_piped['model'].alpha_
    
    list_metrics_lassocv.append(metrics_lassocv)
    list_best_lassocv_parameters.append(best_lassocv_parameters)

Do not hesitate to correct me if you see an impression or an error.