Fit and predict linear regression for each row in the database in Python-CodePudding

Good evening everyone, I am new to Python and I'm trying to learn by reproducing a model I have on Excel

I need to replicate the "TREND" function to fit a small linear model between two extreme points, let's say

A = (1, 0.15) B= (5,0.2)

and predicting using a given value (let's say 4.2).

For the purpose of this code I need to fit a model for each line of my database. All x values are x_1=1 and x_2=5, while y values are different in each line.

I tried using LinearRegression() and model.predict from the sklearn.linear_model package this way

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = {'New_x':[5, 2.1, 4.5, 3.0],
        'X1':[1, 1, 1, 1],
        'X2':[5, 5, 5, 5],
        'Y1':[0.15, 0.7, 1.35, 0.2],
        'Y2':[0.2, 0.85, 1.55, 0.4]}  

df=pd.DataFrame(data,index=["1","2","3","4"])

model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
prediction=model.predict(df["New_x"].values.reshape(-1,1))

But I'm getting this error

    ValueError                                Traceback (most recent call last)
<ipython-input-88-da83cb57bf4a> in <module>()
     18 
     19 model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
---> 20 prediction=model.predict(df["New_x"].values.reshape(-1,1))
     21 
     22 #model = LinearRegression().fit(SEC_ERBA_sample[["Vertex1","Vertex2"]], SEC_ERBA_sample[["SENIOR_1Y","SENIOR_5Y"]])

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    254             Returns predicted values.
    255         """
--> 256         return self._decision_function(X)
    257 
    258     _preprocess_data = staticmethod(_preprocess_data)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
    239         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    240         return safe_sparse_dot(X, self.coef_.T,
--> 241                                dense_output=True)   self.intercept_
    242 
    243     def predict(self, X):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 

ValueError: shapes (4,1) and (2,2) not aligned: 1 (dim 1) != 2 (dim 0)

So I presume that LinearRegression().fit is fitting a unique model based on the column values. Is there a way to fit and predict a linear regression for each row?

CodePudding user response：

I think this is a simple code typo, but may be funded on a deeper conceptual problem, so I'll try to give you a broader answer. The sklearn.base.BaseEstimator#fit trains a ML model by associating a set of features X to a set of ground-truth values y. In your example, you are training two multi-variable regression model to estimate the Y1 and Y2 variables taking X1 and X2 into consideration:

model = LinearRegression().fit(df[["X1","X2"]], df[["Y1","Y2"]])

So the model learns to estimate these two variables taking two other variables into consideration. During predicting, the model requires exactly variables (X1 and X2) to be able to predict the values of interest.

predictions = model.predict(df[["New_x1", "New_x2"]])

If the New_x2 information is not available during test (predict) time, then you either have to estimate it as well or remove it from training altogether.

A simple abstract example: if a model was trained to estimate your preferred t-shirt size from your height and weight, you need to know both height and weight during test (predict) time to obtain the correct size estimation.

CodePudding user response：

I found a solution using iterrow(). Still incomplete as I can't save the output, but I think I will open a separate and more focused question for that

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = {'New_x':[5, 2.1, 4.5, 3.0],
        'X1':[1., 1, 1, 1],
        'X2':[5., 5, 5, 5],
        'Y1':[0.15, 0.7, 1.35, 0.2],
        'Y2':[0.2, 0.85, 1.55, 0.4]}  

df=pd.DataFrame(data,index=["1","2","3","4"])

This final piece allows iterating the linear regression. Using iterrows() is not suggested as many operations can be run in different ways (including vectorization) but in this case I was not finding an alternative solution for this problem

for index, row in df.iterrows():
    model=LinearRegression().fit(np.array([row["X1"],row["X2"]]).reshape(-1,1),
                                 np.array([row["Y1"],row["Y2"]]))
    print(model.predict(row["New_x"]))