Home > OS >  Groupby year-month and drop columns with all NaNs in Python
Groupby year-month and drop columns with all NaNs in Python

Time:11-12

Based on the output dataframe from enter image description here

Pseudocode:

df2 = df.filter(regex='_mape$').groupby(pd.Grouper(freq='M')).dropna(axis=1, how='all')
df2.columns = ['top_1', 'top_2', ..., 'top_k']
df.join(df2)

As @Quang Hoang commented in the last post, we may could use justify_nd to achieve that, but I don't know how. Thanks for your help at advance.

CodePudding user response:

Actually, what you need is, for each group (by year / month):

  • compute errors locally for the current group,
  • find k "wanted" columns (calling argsort) and take indicated columns from models,
  • take the indicated columns from the current group and rename them to top_…,
  • return what you generated so far.

To do it, define a "group processing" function:

def grpProc(grp):
    err = mape(grp[models], grp[['target']])
    sort_args = np.argsort(err) < k
    cols = models[sort_args]
    out_cols = [f'top_{i 1}' for i in range(k)]
    rv = grp.loc[:, cols]
    rv.columns=out_cols
    return rv

Then, to generate top_… columns alone, apply this function to each group:

wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)

And finally generate the expected result joining target column with wrk:

result = df[['target']].join(wrk)

Edit

For the first group (2013-02-28) err contains:

A_values     48.759348
B_values     77.023855
C_values    325.376455
D_values     74.422508
E_values     60.602101

Note that 2 lowest error values are 48.759348 and 60.602101, so from this group you should probably take A_values (this is OK) and E_values (instead of D_values).

So maybe grpProc function instead of:

sort_args = np.argsort(err) < k
cols = models[sort_args]

should contain:

cols = err.nsmallest(k).index
  • Related