Good evening to all,
I am trying to implement for the first time LassoCV
with sklearn
.
My code is as follows:
numeric_features = ['AGE_2019', 'Inhabitants'] categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
,('scaler', MinMaxScaler()) # Centrage des données ])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant',fill_value='missing'))
,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques ])
preprocessor = ColumnTransformer( transformers=[
('numeric', numeric_transformer, numeric_features) ,('categorical', categorical_transformer, categorical_features) ])
# Creation of the pipeline
lassocv_piped = Pipeline([
('preprocessor', preprocessor),
('model', LassoCV())
])
# Creation of the grid of parameters
dt_params = {'model__alphas': np.array([0.5])
}
cv_folds = KFold(n_splits=5, shuffle=True, random_state=0)
lassocv_grid_piped = GridSearchCV(lassocv_piped, dt_params, cv=cv_folds, n_jobs=-1, scoring=['neg_mean_squared_error', 'r2'], refit='r2')
# Fitting our model
lassocv_grid_piped.fit(df_X_train,df_Y_train.values.ravel())
# Getting our metrics and predictions
Y_pred_lassocv = lassocv_grid_piped.predict(df_X_test)
metrics_lassocv = lassocv_grid_piped.cv_results_ best_lassocv_parameters = lassocv_grid_piped.best_params_
print('Best test negatif MSE of the base model : ', max(metrics_lassocv['mean_test_neg_mean_squared_error'])) print('Best test R^2 of the base model : ', max(metrics_lassocv['mean_test_r2'])) print('Best parameters of the base model : ', best_lassocv_parameters)
# Graphique representation
results = pd.DataFrame(dt_params) for k in range(5):
results = pd.concat([results,
pd.DataFrame(lassocv_grid_piped.cv_results_['split' str(k) '_test_neg_mean_squared_error'])],axis=1)
sns.relplot(data=results.melt('model__alphas',value_name='neg_mean_squared_error'),x='model__alphas',y='neg_mean_squared_error',kind='line')
I am still a novice when it comes to using this model. So, I have some questions about the use of this estimator:
Is it useful to use a
cv_fold
outside the estimator, as I do?Is it useful to set up a
GridSearchCV
to test the differentalpha
values?How is it possible to extract the R^2 from our model?
Also, I encounter this error:
AxisError: axis -1 is out of bounds for array of dimension 0
Would you have an idea to solve it?
I wish you a good evening!
CodePudding user response:
After a good night's sleep, I was able to overcome some of my problems.
Is it useful to use a
cv_fold
outside the estimator, as I do ?
After studying the documentation of LassoCV
a bit, it seems not. So I could remove cv_fold
from my code. Instead, I could use the cv
argument of LassoCV
.
Is it useful to set up a GridSearchCV to test the different alpha values?
I haven't really been able to answer that question yet. It seems that LassoCV
does it by itself.
How is it possible to extract the R^2 from our model ?
This can be done simply with the function: .score(X,y)
.
As for my error message. I was able to get rid of it once I deleted GridSearchCV
.
Here's my final code :
numeric_features = ['AGE_2019', 'Inhabitants']
categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
,('scaler', MinMaxScaler()) # Centrage des données
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant',fill_value='missing'))
,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques
])
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
# Creation of the pipeline
list_metrics_lassocv = []
list_best_lassocv_parameters = []
for i in range (1,12) :
lassocv_piped = Pipeline([
('preprocessor', preprocessor),
('model', LassoCV(cv=5, n_alphas=i, random_state=0))
])
# Fitting our model
lassocv_piped.fit(df_X_train,df_Y_train.values.ravel())
# Getting our metrics and predictions
Y_pred_lassocv = lassocv_piped.predict(df_X_test)
metrics_lassocv = lassocv_piped.score(df_X_train,df_Y_train.values.ravel())
best_lassocv_parameters = lassocv_piped['model'].alpha_
list_metrics_lassocv.append(metrics_lassocv)
list_best_lassocv_parameters.append(best_lassocv_parameters)
Do not hesitate to correct me if you see an impression or an error.