Python Sklearn - Adding PolynomialFeatures to my LinearRegression Model. Is this correct?-CodePudding

I have a LinearRegression model which does a moderate job at predicting total applications received for college courses prior to them starting.

I felt the accuracy could be improved by adding PolynomialFeatures as often the rate of applications per day decreases we approach the start date.

My data is processed into array where I am trying to predict '0' value. Here is an example with some fake data. The features are negative numbers that represent days until course starts.

 ------ ------ ------ ----- ---- ---- ---- 
| -300 | -299 | -298 | ... | -2 | -1 | 0  |
 ------ ------ ------ ----- ---- ---- ---- 
|    5 |    5 |    6 | ... | 45 | 46 | 49 |
|    1 |    2 |    2 | ... | 51 | 51 | 52 |
 ------ ------ ------ ----- ---- ---- ----

My original code looks like this:

        # Define X and y for model training
        X = np.array(df.drop(columns='0'))
        y = np.array(df['0'])
        
        # Train model with data test train split
        x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.2)

        # Invoke linear regression model
        linear = linear_model.LinearRegression()

        # Fit the model
        linear.fit(x_train, y_train)

        # Record accuracy score
        accuracy = linear.score(x_test, y_test)

Using an online tutorial i have updated my code to look like this.

        # Define X and y for model training
        X = np.array(df.drop(columns='0'))
        y = np.array(df['0'])

        poly_reg = PolynomialFeatures(degree=2)
        X_poly = poly_reg.fit_transform(X)
        poly_reg.fit(X_poly,y)

        # Train model with data test train split
        x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X_poly, y, test_size = 0.2)

        # Invoke linear regression model
        linear = linear_model.LinearRegression()

        # Fit the model
        linear.fit(x_train, y_train)

        # Record accuracy score
        accuracy = linear.score(x_test, y_test)

I save the model with pickle, then load it and use predict, X here is the array of the data I am looking to input into the model.

        # Convert data to an array
        X = np.array(df)

        # Calculate prediction
        prediction = linear.predict(X)

The issue is I am not sure I am doing this correctly. And whether or not the input data should be passed to .predict in this way.

Any comments would be welcomed.

CodePudding user response：

You cannot simply pass your features X to the model as they need to be transformed into polynomial features first.

However, by using scikit-learn pipeline, you can combine the PolynomialFeatures and LinearRegression steps. With this solution you will be able to pass directly your features X to the model.

As follows:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

from sklearn.datasets import make_regression

X, y = make_regression()

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)

model.score(X, y)