Resample by month and find certain number of columns which have smaller mape values with target colu-CodePudding

Given a dataframe as follows:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, columns=['A_values', 'B_values', 'C_values', 'target'])

Out:

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694  1.389472  3.126731
2013-02-27  9.972433  1.281624  1.789931  7.529254
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-01  9.623960  6.165574  0.866300  5.612724
2013-03-02  6.165247  9.638430  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22  0.589729  6.479978  3.531450  6.872059
2013-05-23  6.279065  3.837670  8.853146  8.209883
2013-05-24  5.533017  5.241127  1.388056  5.355926
2013-05-25  1.596038  4.665995  2.406251  1.971875
2013-05-26  3.269001  1.787529  6.659690  7.545569

I need to resample by month and then calculate mape between A_values, B_values, C_values and target column.

My final objective is to find two columns from A_values, B_values, C_values which have smaller mape values.

The expected result could be a new dataframe from a updated original dataframe by keeping values which have smaller mape values but resetting values of other columns with NaNs.

# function to calculate mape
def mape(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / np.maximum(np.ones(len(y_true)), np.abs(y_true))))*100

# Pseudocode to calculate mape between value columns and target column
A_error = df.resample('M').apply(mape(df['A_values'], df['target']))
B_error = df.resample('M').apply(mape(df['B_values'], df['target']))
C_error = df.resample('M').apply(mape(df['C_values'], df['target']))

At this step, we compare errors between A_values, B_values and C_values for each year-month, ie., for Feb 2013, we find A_values and B_values column has smaller mape values with target column. So I will set the rest column C_values's value by NaNs.

The final expected result could be like this (not exact result, only show the format):

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694       NaN  3.126731
2013-02-27  9.972433  1.281624       NaN  7.529254
2013-02-28  6.621605  7.843101       NaN  0.585713
2013-03-01  9.623960       NaN  0.866300  5.612724
2013-03-02  6.165247       NaN  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22       NaN  6.479978  3.531450  6.872059
2013-05-23       NaN  3.837670  8.853146  8.209883
2013-05-24       NaN  5.241127  1.388056  5.355926
2013-05-25       NaN  4.665995  2.406251  1.971875
2013-05-26       NaN  1.787529  6.659690  7.545569

How could I achieve that in Pandas? Thanks for your help at advance.

EDIT: if we implement for each whole column, with code below, we find C_values and A_values have smaller mape, so I will replace values of B_values with NaNs.

mape(df['A_values'], df['target'])
Out[65]: 1.1619069176493515
mape(df['B_values'], df['target'])
Out[66]: 1.3477232830950627
mape(df['C_values'], df['target'])
Out[67]: 1.045038328819569

But since I hope to groupby year-month (or resample by month) and then comprare them, thus I raise a question for a help.

Reference link:

Pandas Resample Apply Custom Function?

CodePudding user response：

You can rewrite your mape function so it works on pandas dataframe, then groupby().apply, and finally merge_asof to merge back:

# all your models
models = df.columns[:-1]

# function to calculate mape
def mape(y_true, y_pred):
    y_pred = np.array(y_pred)
    return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
                   axis=0)*100

errors = (df.groupby(pd.Grouper(freq='M'))
            .apply(lambda x: mape(x[models], x[['target']]))
         )
aligned_mape = pd.merge_asof(df[['target']], errors, 
                             left_index=True, 
                             right_index=True,
                             direction='forward'
                            )

mask = (pd.get_dummies(aligned_mape[models].idxmax(axis=1))
          .reindex(columns=models, fill_value=0).astype(bool)
       )

df[models] = df[models].mask(mask)

Output:

            A_values  B_values  C_values    target
2013-02-26  6.059783       NaN  1.389472  3.126731
2013-02-27  9.972433       NaN  1.789931  7.529254
2013-02-28  6.621605       NaN  0.968944  0.585713
2013-03-01  9.623960       NaN  0.866300  5.612724
2013-03-02  6.165247       NaN  5.743043  3.711608
...              ...       ...       ...       ...
2013-05-22       NaN  6.479978  3.531450  6.872059
2013-05-23       NaN  3.837670  8.853146  8.209883
2013-05-24       NaN  5.241127  1.388056  5.355926
2013-05-25       NaN  4.665995  2.406251  1.971875
2013-05-26       NaN  1.787529  6.659690  7.545569