sklearn cross_val_score() returning same MSE for different degree polynomials-CodePudding

The title is clear, below a code to fully reproduce my example:

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

np.random.seed(1)

url = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = pd.read_csv(url, na_values='?').dropna()

X,y = data['horsepower'], data.mpg

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

lm = LinearRegression()

for i in range(1,6):
    poly = PolynomialFeatures(degree=i)
    X_train_poly = poly.fit_transform(X_train.values.reshape(-1,1))
    model = lm.fit(X_train_poly, y_train)
    scores_poly = cross_val_score(model, X_test.values.reshape(-1,1),y_test,scoring='neg_mean_squared_error',
                            cv = len(X_test), n_jobs = -1)
    
    print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')

Output:

Degree-1 polynomial, MSE: 25.214173196098535
Degree-2 polynomial, MSE: 25.214173196098535
Degree-3 polynomial, MSE: 25.214173196098535
Degree-4 polynomial, MSE: 25.214173196098535
Degree-5 polynomial, MSE: 25.214173196098535

I would expect to get different MSE for different degree polynomial..

----------EDIT-----------

I included the following line to the above for loop:

X_test_poly = poly.fit_transform(X_test)

And changed the scores_poly to use X_test_poly instead of X_test. Now I get different values for the MSE.

(Below the reformulated fully reproducible code)

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split

np.random.seed(1)

url = 'https://raw.githubusercontent.com/selva86/datasets/master/Auto.csv'
data = pd.read_csv(url, na_values='?').dropna()

X,y = data['horsepower'].values.reshape(-1,1), data.mpg.values.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

lm = LinearRegression()

for i in range(1,6):
    poly = PolynomialFeatures(degree=i)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.fit_transform(X_test)
    model = lm.fit(X_train_poly, y_train)
    scores_poly = cross_val_score(model, X_test_poly,y_test,scoring='neg_mean_squared_error',
                            cv = len(X_test), n_jobs = -1)
    
    print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')

Output:

Degree-1 polynomial, MSE: 25.214173196098535
Degree-2 polynomial, MSE: 18.678068281482577
Degree-3 polynomial, MSE: 18.85993623145088
Degree-4 polynomial, MSE: 19.085567664927655
Degree-5 polynomial, MSE: 18.79973878463155

I guess this is the correct approach, but I wonder why.. could anyone clarify?

CodePudding user response：

With every cross validation, the model you provided will be fitted with the training data. In your first code, even though you fitted the model with a polynomial model = lm.fit(X_train_poly, y_train) , with every iteration inside cross_val_score you will refit a linear model with X_test. Because X_test is only of degree one, with every iteration, you are only scoring the linear model with degree one.

You can actually see what the PolynomialFeatures does (see also the documentation) :

X,y = data[['horsepower']], data['mpg']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5)

poly = PolynomialFeatures(degree=3)
poly.fit_transform(X_train)[:10]
 
array([[1.000000e 00, 1.700000e 02, 2.890000e 04, 4.913000e 06],
       [1.000000e 00, 5.400000e 01, 2.916000e 03, 1.574640e 05],
       [1.000000e 00, 6.500000e 01, 4.225000e 03, 2.746250e 05],
       [1.000000e 00, 1.100000e 02, 1.210000e 04, 1.331000e 06],
       [1.000000e 00, 8.800000e 01, 7.744000e 03, 6.814720e 05],
       [1.000000e 00, 1.500000e 02, 2.250000e 04, 3.375000e 06],
       [1.000000e 00, 1.980000e 02, 3.920400e 04, 7.762392e 06],
       [1.000000e 00, 8.800000e 01, 7.744000e 03, 6.814720e 05],
       [1.000000e 00, 1.400000e 02, 1.960000e 04, 2.744000e 06],
       [1.000000e 00, 1.600000e 02, 2.560000e 04, 4.096000e 06]])

The first column is the intercept (all ones), second column your X_train or horsepower followed by square of horsepower and then cubic of horsepower . LinearRegression() fits this matrix and does not do any transformation of the input for you.

So note that you most likely what to fit the linear model with LinearRegression(intercept = False) since the intercept is already in the polynomial transformation.

What you consider doing is using Pipeline as shown in this sklearn documentation :

from sklearn.pipeline import Pipeline

for i in range(1,6):
    model = Pipeline([('poly', PolynomialFeatures(degree=i)),
    ('linear', LinearRegression(fit_intercept=False))])
    scores_poly = cross_val_score(model, X_test,y_test,scoring='neg_mean_squared_error',
                            cv = len(X_test), n_jobs = -1)
    
    print(f'Degree-{i} polynomial, MSE: {abs(scores_poly).mean()}')

Gives :

Degree-1 polynomial, MSE: 22.2130277556231
Degree-2 polynomial, MSE: 17.01069056186582
Degree-3 polynomial, MSE: 17.283994438261868
Degree-4 polynomial, MSE: 17.401530563158406
Degree-5 polynomial, MSE: 15.934755858114803