Home > Software design >  Iteratively read excel files, calculate MAE and append to one dataframe using Python
Iteratively read excel files, calculate MAE and append to one dataframe using Python

Time:12-26

Given a folder which contains 3 predictions result, one example as follows:

          date  real      pred indicator
0    2021/1/31   1.0  0.586583     train
1    2021/2/28   0.6  0.442644     train
2    2021/3/31  -0.5 -0.011877     train
3    2021/4/30  -0.3 -0.011877     train
4    2021/5/31  -0.2 -0.011877     train
5    2021/6/30  -0.4 -0.011877     train
6    2021/7/31   0.3  0.152951     train
7    2021/8/31   0.1  0.393088     train
8    2021/9/30   0.0  0.163108     train
9   2021/10/31   0.7  0.537028     valid
10  2021/11/30   0.4  0.586583     valid

I hope to loop and read all excel files, calculate MAE respectively for subset train and test of each prediction file, and finally concatenate the calculated MAE values and excel file names.

Code I have done:

import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
import os
from pathlib import Path
import re
import glob

file_paths = glob.glob('./preds/*.xlsx')
dfs = pd.DataFrame()

for file_path in file_paths:
    data = pd.read_excel(file_path)
    # print(data)
    model_name = Path(file_path).stem
    print(model_name)
    
    train = data.loc[data['indicator'] == 'train']
    valid = data.loc[data['indicator'] == 'valid']
    
    print(mean_absolute_error(train['real'], train['pred']))
    print(mean_absolute_error(valid['real'], valid['pred']))

Out:

pred3
0.06462038177996876
0.14329840342203767
pred2
0.29968300930091313
0.20961438417434672
pred1
0.2807230830192565
0.1747779786586765

To create dataframe for save output:

col_names = ['model', 'train_mae', 'valid_mae']
dfs = pd.DataFrame(columns=col_names)

My question is how could I append the iteratively calculated MAE values to the final result? Thanks.

The expected result:

   model  mae_train  mae_valid
0  pred1   0.280723   0.174778
1  pred2   0.299683   0.209614
2  pred3   0.064620   0.143298

EDIT:

model_name = []
train_mae = []
test_mae = []

for file_path in file_paths:
    data = pd.read_excel(file_path)
    # print(data)
    model_name = Path(file_path).stem
    print(model_name)

    train = data.loc[data['indicator'] == 'train']
    valid = data.loc[data['indicator'] == 'valid']

    print(mean_absolute_error(train['real'], train['pred']))
    print(mean_absolute_error(valid['real'], valid['pred']))

    train_mae = mean_absolute_error(train['real'], train['pred'])
    valid_mae = mean_absolute_error(valid['real'], valid['pred'])

    model_name.append(model_name)
    train_mae.append(train_mae)
    valid_mae.append(valid_mae)

df = pd.DataFrame(list(zip(model_name , train_mae, valid_mae)),
           columns =['model', 'train_mae', 'valid_mae'])

Errors:

Traceback (most recent call last):
  File "<input>", line 23, in <module>
AttributeError: 'str' object has no attribute 'append'

CodePudding user response:

Use a list of lists to store your information, then pass it to pd.DataFrame:

mae_data = []

for file_path in file_paths:
    data = pd.read_excel(file_path)
    # print(data)
    model_name = Path(file_path).stem
    print(model_name)
    
    train = data.loc[data['indicator'] == 'train']
    valid = data.loc[data['indicator'] == 'valid']
    
    print(mean_absolute_error(train['real'], train['pred']))
    print(mean_absolute_error(valid['real'], valid['pred']))
    
    ## append [model_name, mae_train, mae_valid] to mae_list
    mae_data.append([model_name, mean_absolute_error(train['real'], train['pred']),mean_absolute_error(valid['real'], valid['pred'])])

col_names = ['model', 'train_mae', 'valid_mae']
dfs = pd.DataFrame(data=mae_data, columns=col_names)
  • Related