Getting different values for alpha using RidgeCV with normalize=True and using Pipeline-CodePudding

I am trying to determine which alpha is the best in a Ridge Regression with scoring = 'neg_mean_squared_error'.

I have an array with some values for alpha ranging from 5e09 to 5e-03:

array([5.00000000e 09, 3.78231664e 09, 2.86118383e 09, 2.16438064e 09,
       1.63727458e 09, 1.23853818e 09, 9.36908711e 08, 7.08737081e 08,
       5.36133611e 08, 4.05565415e 08, 3.06795364e 08, 2.32079442e 08,
       1.75559587e 08, 1.32804389e 08, 1.00461650e 08, 7.59955541e 07,
       5.74878498e 07, 4.34874501e 07, 3.28966612e 07, 2.48851178e 07,
       1.88246790e 07, 1.42401793e 07, 1.07721735e 07, 8.14875417e 06,
       6.16423370e 06, 4.66301673e 06, 3.52740116e 06, 2.66834962e 06,
       2.01850863e 06, 1.52692775e 06, 1.15506485e 06, 8.73764200e 05,
       6.60970574e 05, 5.00000000e 05, 3.78231664e 05, 2.86118383e 05,
       2.16438064e 05, 1.63727458e 05, 1.23853818e 05, 9.36908711e 04,
       7.08737081e 04, 5.36133611e 04, 4.05565415e 04, 3.06795364e 04,
       2.32079442e 04, 1.75559587e 04, 1.32804389e 04, 1.00461650e 04,
       7.59955541e 03, 5.74878498e 03, 4.34874501e 03, 3.28966612e 03,
       2.48851178e 03, 1.88246790e 03, 1.42401793e 03, 1.07721735e 03,
       8.14875417e 02, 6.16423370e 02, 4.66301673e 02, 3.52740116e 02,
       2.66834962e 02, 2.01850863e 02, 1.52692775e 02, 1.15506485e 02,
       8.73764200e 01, 6.60970574e 01, 5.00000000e 01, 3.78231664e 01,
       2.86118383e 01, 2.16438064e 01, 1.63727458e 01, 1.23853818e 01,
       9.36908711e 00, 7.08737081e 00, 5.36133611e 00, 4.05565415e 00,
       3.06795364e 00, 2.32079442e 00, 1.75559587e 00, 1.32804389e 00,
       1.00461650e 00, 7.59955541e-01, 5.74878498e-01, 4.34874501e-01,
       3.28966612e-01, 2.48851178e-01, 1.88246790e-01, 1.42401793e-01,
       1.07721735e-01, 8.14875417e-02, 6.16423370e-02, 4.66301673e-02,
       3.52740116e-02, 2.66834962e-02, 2.01850863e-02, 1.52692775e-02,
       1.15506485e-02, 8.73764200e-03, 6.60970574e-03, 5.00000000e-03])

Then, I used RidgeCV to try and determine which of these values would be best:

ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', 
                  normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_

and I got ridgecv.alpha_ = 0.006609705742330144

However, I received a warning that normalize = True is deprecated and will be removed in version 1.2. The warning advised me to use Pipeline and StandardScaler instead. Then, following instructions of how to do a Pipeline, I did:

steps = [
    ('scalar', StandardScaler(with_mean=False)),
    ('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error', cv=KFold(10)))
]

ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)

y_pred = ridge_pipe.predict(X_test)

ridge_pipe2.named_steps.model.alpha_

Doing this way, I got ridge_pipe2.named_steps.model.alpha_ = 1.328043891473342

For a last check, I also used GridSearchCV as follows:

steps = [
    ('scalar', StandardScaler()),
    ('model',Ridge())
]

ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)

parameters = [{'model__alpha':alphas}]


grid_search = GridSearchCV(estimator = ridge_pipe,
                          param_grid = parameters,
                          scoring = 'neg_mean_squared_error',
                          cv = 10,
                          n_jobs = -1)

grid_search = grid_search.fit(X_train, y_train) 
grid_search.best_estimator_.get_params

where I got grid_search.best_estimator_.get_params = 1.328043891473342 (same as the other Pipeline approach).

Therefore, my question... why normalizing my dataset with normalize=True or with StandardScaler() yields different best alpha values?

CodePudding user response：

You need to ensure the same cross validation is used and scale without centering the data.

When you run with normalize=True, you get this as part of the warning :

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

Regarding the cv, if you check the documentation, RidgeCV by default performs leave-one-out cross validation :

Ridge regression with built-in cross-validation.
See glossary entry for cross-validation estimator.
By default, it performs efficient Leave-One-Out Cross-Validation.

So to get the same result, we can define a cross-validation to use :

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold

kf = KFold(10)

X_train, y_train = datasets.make_regression()

alphas = [0.001,0.005,0.01,0.05,0.1]

ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.001

And use it on pipeline :

steps = [
    ('scalar', StandardScaler(with_mean=False)),
    ('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error',cv=kf))
]

ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)

ridge_pipe2.named_steps.model.alpha_
0.001

CodePudding user response：

The corresponding warning message for ordinary Ridge makes an additional mention:

Set parameter alpha to: original_alpha * n_samples.

(I don't entirely understand why this is, but for now I'm willing to leave it. There should probably be a note added into the warning for RidgeCV along these lines.) Changing your alphas parameter in the second approach to [alph * X.shape[0] for alph in alphas] should work. The selected alpha_ will be different, but rescaling again ridge_pipe2.named_steps.model.alpha_ / X.shape[0] and I retrieve the same value as in the first approach (as well as the same rescaled coefficients).

(I've used the dataset shared in the linked question, and added the experiment to the notebook I created there.)