Good evening everyone, I am new to Python and I'm trying to learn by reproducing a model I have on Excel
I need to replicate the "TREND" function to fit a small linear model between two extreme points, let's say
A = (1, 0.15) B= (5,0.2)
and predicting using a given value (let's say 4.2).
For the purpose of this code I need to fit a model for each line of my database. All x values are x_1=1 and x_2=5, while y values are different in each line.
I tried using LinearRegression() and model.predict from the sklearn.linear_model package this way
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
data = {'New_x':[5, 2.1, 4.5, 3.0],
'X1':[1, 1, 1, 1],
'X2':[5, 5, 5, 5],
'Y1':[0.15, 0.7, 1.35, 0.2],
'Y2':[0.2, 0.85, 1.55, 0.4]}
df=pd.DataFrame(data,index=["1","2","3","4"])
model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
prediction=model.predict(df["New_x"].values.reshape(-1,1))
But I'm getting this error
ValueError Traceback (most recent call last)
<ipython-input-88-da83cb57bf4a> in <module>()
18
19 model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
---> 20 prediction=model.predict(df["New_x"].values.reshape(-1,1))
21
22 #model = LinearRegression().fit(SEC_ERBA_sample[["Vertex1","Vertex2"]], SEC_ERBA_sample[["SENIOR_1Y","SENIOR_5Y"]])
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) self.intercept_
242
243 def predict(self, X):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
ValueError: shapes (4,1) and (2,2) not aligned: 1 (dim 1) != 2 (dim 0)
So I presume that LinearRegression().fit
is fitting a unique model based on the column values. Is there a way to fit and predict a linear regression for each row?
CodePudding user response:
I think this is a simple code typo, but may be funded on a deeper conceptual problem, so I'll try to give you a broader answer.
The sklearn.base.BaseEstimator#fit
trains a ML model by associating a set of features X
to a set of ground-truth values y
. In your example, you are training two multi-variable regression model to estimate the Y1
and Y2
variables taking X1
and X2
into consideration:
model = LinearRegression().fit(df[["X1","X2"]], df[["Y1","Y2"]])
So the model learns to estimate these two variables taking two other variables into consideration.
During predicting, the model requires exactly variables (X1
and X2
) to be able to predict the values of interest.
predictions = model.predict(df[["New_x1", "New_x2"]])
If the New_x2
information is not available during test (predict) time, then you either have to estimate it as well or remove it from training altogether.
A simple abstract example: if a model was trained to estimate your preferred t-shirt size from your height and weight, you need to know both height and weight during test (predict) time to obtain the correct size estimation.
CodePudding user response:
I found a solution using iterrow(). Still incomplete as I can't save the output, but I think I will open a separate and more focused question for that
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
data = {'New_x':[5, 2.1, 4.5, 3.0],
'X1':[1., 1, 1, 1],
'X2':[5., 5, 5, 5],
'Y1':[0.15, 0.7, 1.35, 0.2],
'Y2':[0.2, 0.85, 1.55, 0.4]}
df=pd.DataFrame(data,index=["1","2","3","4"])
This final piece allows iterating the linear regression. Using iterrows() is not suggested as many operations can be run in different ways (including vectorization) but in this case I was not finding an alternative solution for this problem
for index, row in df.iterrows():
model=LinearRegression().fit(np.array([row["X1"],row["X2"]]).reshape(-1,1),
np.array([row["Y1"],row["Y2"]]))
print(model.predict(row["New_x"]))