Calculate Mean Absolute Error for each row of a Pandas dataframe-CodePudding

Below is a sample of pandas dataframe that I'm working with.
I want to calculate mean absolute error for each row but only considering relevant columns for values of ID column.

There may be 2 4,6 or 8 columns relevant to values of ID column. For example, relevant columns for 'id4' is 'id4_signal1_true' and 'id4_signal1_pred'. for 'id1' its 'id1_signal1_true', 'id1_signal2_true', 'id1_signal1_pred' and 'id1_signal2_pred'.

import pandas as pd
list = [['id4',0.37,0.97,0.21,0.54,0.11,0.38,0.95,0.2,0.5,0.23],
        ['id1',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23],
        ['id3',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23]]

df = pd.DataFrame(list, columns =['ID','id1_signal1_true','id1_signal2_true','id4_signal1_true','id3_signal1_true',
                                  'id3_signal2_true','id1_signal1_pred','id1_signal2_pred','id4_signal1_pred',
                                  'id3_signal1_pred','id3_signal2_pred'])

I want to calculate mae for each row only considering the specific relevant columns. for example, for the first row, it should be like mean(abs(id4_signal1_true-id4_signal1_pred)) for the second row, it should be mean(abs('id1_signal1_true'-'id1_signal1_pred'),abs('id1_signal2_true'-'id1_signal2_pred'))

below is a screenshot of how the output looks like. MAE is the column which i want to get

I used the below code to solve this. this is working fine. But the only issue is i have around 2 million rows and this takes hours. i want to find a efficient way to do this. Highly appreciated your help on this

Attack = df
ID_MAE = []
for id in range(len(Attack['ID'])):

    signals = np.array(Attack[Attack.columns[Attack.columns.str.contains(Attack.ID[id])]]) # select columns relevant to current ID
    signal = signals[id]  # select only the specific row (id)

    no_of_signals = int(len(signal)/2) # identify number of signals
    reshaped_arr = np.reshape(signal, (2,no_of_signals))
    signal_true = reshaped_arr[0]  # array for true values
    signal_pred = reshaped_arr[1]  # array for predicted values

    # mae calculation
    MAE = np.mean(np.abs(signal_true - signal_pred), axis=0)
    ID_MAE.append(MAE)

df['MAE'] = ID_MAE

CodePudding user response：

Try:

Split the columns into a three-level header
groupby ID and Signal and get the MAE
Select the correct MAE for each row
Collapse the multi-level header back to a single level.

df = df.set_index("ID").rename_axis(None)
df.columns = df.columns.str.split("_",expand=True)
df = df.rename_axis(["ID","Signal","Type"],axis=1).sort_values(["ID","Signal"],axis=1)
MAE = df.groupby(["ID","Signal"], axis=1).diff().abs().groupby("ID", axis=1).mean()
df.columns = df.columns.map("_".join)

df["MAE"] = df.index.to_series().apply(lambda x: MAE.at[x,x])

>>> df
     id1_signal1_true  id1_signal1_pred  ...  id4_signal1_pred   MAE
id4              0.37              0.38  ...               0.2  0.01
id1              0.41              0.41  ...               0.2  0.02
id3              0.41              0.41  ...               0.2  0.08

[3 rows x 11 columns]

CodePudding user response：

import numpy as np
df2 = df.set_index('ID').apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.name)]].to_dict().items())))).reshape(-1,2)),  axis=1)
df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())

output:

ID
id4    0.01
id1    0.02
id3    0.08
dtype: float64

Try it, if it is good, I will edit and explain ... :)

UPDATE:

or you can:

df2 = df.apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2)),  axis=1)
df["MAE"] = df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())