Sequentially calculate RMSE for each row, each time using the next row (Pandas)-CodePudding

I have dataframe of y_true values and y_predicted values.

I would need to sequently calculate RMSE for each row, each time using the next row. Could describe it something like row 1. After that, store the result in new column RMSE.

Here is the dataframe. The code would take first row of y_true = 105, y_pred = 195 and calculate RMSE (I use from sklearn.metrics import mean_squared_error) which would be 90.0 and put it in new column.

After that we take the first row and the next one which would be y_true = [105, 297], y_pred = [195,337] and calculate the error again, the result would be 69.6. And do this for the whole dataframe. Next time y_true = [105, 297, 262], y_pred = [195,337, 804] and so on. I assume that this could be done with a loop.

   ID  y_true  y_pred
0   1     105     195
1   2     297     337
2   3     262     804
3   4     206     708
4   5     226     705
5   6     309     463
6   7     210     676
7   8     124     573

The result should be:

   ID  y_true  y_pred    RMSE
0   1     105     195   90.00
1   2     297     337   69.60
2   3     262     804  318.04
3   4     206     708  372.64
4   5     226     705  396.20
5   6     309     463  367.11
6   7     210     676  382.00
7   8     124     573  391.00

Currently the code calculates the RMSE

import pandas as pd
from sklearn.metrics import mean_squared_error
import numpy as np

dataframe = pd.read_excel(r'S:\...\df.xlsx')

y_true = dataframe['y_true']
y_pred = dataframe['y_pred']

rmse = mean_squared_error(y_true, y_pred, squared=False)
print(rmse)

But the question is how to calculate it for rows, each time taking the next one?

CodePudding user response：

If your data frame is not too huge, you can use a list comprehension:

import pandas as pd
from sklearn.metrics import mean_squared_error
import numpy as np

df = pd.DataFrame({'y_true':[105,297,262,206],
'y_pred':[195,337,804,708]})

df['RMSE'] = [mean_squared_error(df.y_true[:(i 1)],
df.y_pred[:(i 1)], squared=False) for i in range(df.shape[0])]

   y_true  y_pred        RMSE
0     105     195   90.000000
1     297     337   69.641941
2     262     804  318.048214
3     206     708  372.648628

Or based on how RMSE is calculated, you want the square root of the incremental mean of squared error, so this gives squared error:

(df.y_true - df.y_pred)**2)

This gives incremental mean of the above:

(df.y_true - df.y_pred)**2).expanding().mean()

This adds the square root, which u can see is the same as above:

df['RMSE'] = np.sqrt(((df.y_true - df.y_pred)**2).expanding().mean())

   y_true  y_pred        RMSE
0     105     195   90.000000
1     297     337   69.641941
2     262     804  318.048214
3     206     708  372.648628

CodePudding user response：

Another way:

import pandas as pd
from sklearn.metrics import mean_squared_error
df = pd.read_excel('test.xls')

df["RMSE"] = df.apply(lambda x: mean_squared_error(df.loc[:x.name, 'y_true'].tolist(), df.loc[:x.name, 'y_pred'].tolist(), squared=False), axis = 1)
print(df)