Home > OS >  Python Sklearn - Adding PolynomialFeatures to my LinearRegression Model. Is this correct?
Python Sklearn - Adding PolynomialFeatures to my LinearRegression Model. Is this correct?

Time:06-09

I have a LinearRegression model which does a moderate job at predicting total applications received for college courses prior to them starting.

I felt the accuracy could be improved by adding PolynomialFeatures as often the rate of applications per day decreases we approach the start date.

My data is processed into array where I am trying to predict '0' value. Here is an example with some fake data. The features are negative numbers that represent days until course starts.

 ------ ------ ------ ----- ---- ---- ---- 
| -300 | -299 | -298 | ... | -2 | -1 | 0  |
 ------ ------ ------ ----- ---- ---- ---- 
|    5 |    5 |    6 | ... | 45 | 46 | 49 |
|    1 |    2 |    2 | ... | 51 | 51 | 52 |
 ------ ------ ------ ----- ---- ---- ---- 

My original code looks like this:

        # Define X and y for model training
        X = np.array(df.drop(columns='0'))
        y = np.array(df['0'])
        
        # Train model with data test train split
        x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.2)

        # Invoke linear regression model
        linear = linear_model.LinearRegression()

        # Fit the model
        linear.fit(x_train, y_train)

        # Record accuracy score
        accuracy = linear.score(x_test, y_test)

Using an online tutorial i have updated my code to look like this.

        # Define X and y for model training
        X = np.array(df.drop(columns='0'))
        y = np.array(df['0'])

        poly_reg = PolynomialFeatures(degree=2)
        X_poly = poly_reg.fit_transform(X)
        poly_reg.fit(X_poly,y)

        # Train model with data test train split
        x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X_poly, y, test_size = 0.2)

        # Invoke linear regression model
        linear = linear_model.LinearRegression()

        # Fit the model
        linear.fit(x_train, y_train)

        # Record accuracy score
        accuracy = linear.score(x_test, y_test)

I save the model with pickle, then load it and use predict, X here is the array of the data I am looking to input into the model.

        # Convert data to an array
        X = np.array(df)

        # Calculate prediction
        prediction = linear.predict(X)

The issue is I am not sure I am doing this correctly. And whether or not the input data should be passed to .predict in this way.

Any comments would be welcomed.

CodePudding user response:

You cannot simply pass your features X to the model as they need to be transformed into polynomial features first.

However, by using scikit-learn pipeline, you can combine the PolynomialFeatures and LinearRegression steps. With this solution you will be able to pass directly your features X to the model.

As follows:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

from sklearn.datasets import make_regression

X, y = make_regression()

model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)

model.score(X, y)
  • Related