Home > other >  Resample by month and find certain number of columns which have smaller mape values with target colu
Resample by month and find certain number of columns which have smaller mape values with target colu

Time:11-12

Given a dataframe as follows:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, columns=['A_values', 'B_values', 'C_values', 'target'])

Out:

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694  1.389472  3.126731
2013-02-27  9.972433  1.281624  1.789931  7.529254
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-01  9.623960  6.165574  0.866300  5.612724
2013-03-02  6.165247  9.638430  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22  0.589729  6.479978  3.531450  6.872059
2013-05-23  6.279065  3.837670  8.853146  8.209883
2013-05-24  5.533017  5.241127  1.388056  5.355926
2013-05-25  1.596038  4.665995  2.406251  1.971875
2013-05-26  3.269001  1.787529  6.659690  7.545569

I need to resample by month and then calculate mape between A_values, B_values, C_values and target column.

My final objective is to find two columns from A_values, B_values, C_values which have smaller mape values.

The expected result could be a new dataframe from a updated original dataframe by keeping values which have smaller mape values but resetting values of other columns with NaNs.

# function to calculate mape
def mape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / np.maximum(np.ones(len(y_true)), np.abs(y_true))))*100

# Pseudocode to calculate mape between value columns and target column
A_error = df.resample('M').apply(mape(df['A_values'], df['target']))
B_error = df.resample('M').apply(mape(df['B_values'], df['target']))
C_error = df.resample('M').apply(mape(df['C_values'], df['target']))

At this step, we compare errors between A_values, B_values and C_values for each year-month, ie., for Feb 2013, we find A_values and B_values column has smaller mape values with target column. So I will set the rest column C_values's value by NaNs.

The final expected result could be like this (not exact result, only show the format):

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694       NaN  3.126731
2013-02-27  9.972433  1.281624       NaN  7.529254
2013-02-28  6.621605  7.843101       NaN  0.585713
2013-03-01  9.623960       NaN  0.866300  5.612724
2013-03-02  6.165247       NaN  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22       NaN  6.479978  3.531450  6.872059
2013-05-23       NaN  3.837670  8.853146  8.209883
2013-05-24       NaN  5.241127  1.388056  5.355926
2013-05-25       NaN  4.665995  2.406251  1.971875
2013-05-26       NaN  1.787529  6.659690  7.545569

How could I achieve that in Pandas? Thanks for your help at advance.

EDIT: if we implement for each whole column, with code below, we find C_values and A_values have smaller mape, so I will replace values of B_values with NaNs.

mape(df['A_values'], df['target'])
Out[65]: 1.1619069176493515
mape(df['B_values'], df['target'])
Out[66]: 1.3477232830950627
mape(df['C_values'], df['target'])
Out[67]: 1.045038328819569

But since I hope to groupby year-month (or resample by month) and then comprare them, thus I raise a question for a help.

Reference link:

Pandas Resample Apply Custom Function?

CodePudding user response:

You can rewrite your mape function so it works on pandas dataframe, then groupby().apply, and finally merge_asof to merge back:

# all your models
models = df.columns[:-1]

# function to calculate mape
def mape(y_true, y_pred):
    y_pred = np.array(y_pred)
    return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
                   axis=0)*100

errors = (df.groupby(pd.Grouper(freq='M'))
            .apply(lambda x: mape(x[models], x[['target']]))
         )
aligned_mape = pd.merge_asof(df[['target']], errors, 
                             left_index=True, 
                             right_index=True,
                             direction='forward'
                            )

mask = (pd.get_dummies(aligned_mape[models].idxmax(axis=1))
          .reindex(columns=models, fill_value=0).astype(bool)
       )

df[models] = df[models].mask(mask)

Output:

            A_values  B_values  C_values    target
2013-02-26  6.059783       NaN  1.389472  3.126731
2013-02-27  9.972433       NaN  1.789931  7.529254
2013-02-28  6.621605       NaN  0.968944  0.585713
2013-03-01  9.623960       NaN  0.866300  5.612724
2013-03-02  6.165247       NaN  5.743043  3.711608
...              ...       ...       ...       ...
2013-05-22       NaN  6.479978  3.531450  6.872059
2013-05-23       NaN  3.837670  8.853146  8.209883
2013-05-24       NaN  5.241127  1.388056  5.355926
2013-05-25       NaN  4.665995  2.406251  1.971875
2013-05-26       NaN  1.787529  6.659690  7.545569
  • Related