Let's say that I have the following two dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
I would like to generate a dataframe (df3) that contains the same column names but with the RMSE value between the corresponding columns of the two dataframes.
I know that RMSE can be calculated as in the following example, but am not sure of an efficient way to extend this to the DataFrames (my actual DataFrames that I want to extend this example to have many columns):
from sklearn.metrics import mean_squared_error
import math
y_actual = [1,2,3,4,5]
y_predicted = [1.6,2.5,2.9,3,4.1]
MSE = mean_squared_error(y_actual, y_predicted)
RMSE = math.sqrt(MSE)
CodePudding user response:
I think this may be what you are looking for:
First find the columns that exist in both frames
s = df1.columns.intersection(df2.columns)
Then find the RMSE for each of the intersecting columns
df1[s].apply(lambda x: math.sqrt(mean_squared_error(x, df2[x.name])))
Result
A 1.348552
B 1.360788
C 1.325903
D 1.351737