I have a LinearRegression model which does a moderate job at predicting total applications received for college courses prior to them starting.
I felt the accuracy could be improved by adding PolynomialFeatures as often the rate of applications per day decreases we approach the start date.
My data is processed into array where I am trying to predict '0' value. Here is an example with some fake data. The features are negative numbers that represent days until course starts.
------ ------ ------ ----- ---- ---- ----
| -300 | -299 | -298 | ... | -2 | -1 | 0 |
------ ------ ------ ----- ---- ---- ----
| 5 | 5 | 6 | ... | 45 | 46 | 49 |
| 1 | 2 | 2 | ... | 51 | 51 | 52 |
------ ------ ------ ----- ---- ---- ----
My original code looks like this:
# Define X and y for model training
X = np.array(df.drop(columns='0'))
y = np.array(df['0'])
# Train model with data test train split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.2)
# Invoke linear regression model
linear = linear_model.LinearRegression()
# Fit the model
linear.fit(x_train, y_train)
# Record accuracy score
accuracy = linear.score(x_test, y_test)
Using an online tutorial i have updated my code to look like this.
# Define X and y for model training
X = np.array(df.drop(columns='0'))
y = np.array(df['0'])
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly,y)
# Train model with data test train split
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X_poly, y, test_size = 0.2)
# Invoke linear regression model
linear = linear_model.LinearRegression()
# Fit the model
linear.fit(x_train, y_train)
# Record accuracy score
accuracy = linear.score(x_test, y_test)
I save the model with pickle, then load it and use predict, X here is the array of the data I am looking to input into the model.
# Convert data to an array
X = np.array(df)
# Calculate prediction
prediction = linear.predict(X)
The issue is I am not sure I am doing this correctly. And whether or not the input data should be passed to .predict in this way.
Any comments would be welcomed.
CodePudding user response:
You cannot simply pass your features X
to the model as they need to be transformed into polynomial features first.
However, by using scikit-learn pipeline
, you can combine the PolynomialFeatures
and LinearRegression
steps. With this solution you will be able to pass directly your features X
to the model.
As follows:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression
X, y = make_regression()
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)
model.score(X, y)