How to groupby average in multicondition with Python Pandas-CodePudding

The objective is calculate subset column average based on the multi condition in a multiindex dataframe.

The first condition is to get the average group by the first level of the multiindex.

The second condition is to get the average based on the dict_ref value below

dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )

For example,for the key occ, get the average for the F2 and F4.

The code below should do the work

import re

import numpy as np
import numpy.random
import pandas as pd

numpy.random.seed(0)
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
_names=['pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr',
        'hjor_com','hjor_com','hjor_com','hjor_com']

_idx=['Fp1_band0','Fp1_band1','Fpx_band0','Fpx_band1','F2_band0','F2_band1','F4_band0','F4_band1',
      'Fp1','Fpx','F2','F4']

X=np.random.rand(4,len(_names))
columns = pd.MultiIndex.from_arrays([_names, _idx])
df=pd.DataFrame(data=X, columns=columns)

remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t   ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]

df.columns = pd.MultiIndex.from_tuples ( remove_nan )


df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
                                               "level_1": "ch",  "level_2": "feature","level_3": "region"} )

all_df = []
for nref in dict_ref:

    df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
    df_ch ['ch'] = nref
    all_df.append ( df_ch )

df1 = pd.concat ( [df, *all_df] ).pivot_table ( index=['group_feature', 'ch', 'feature'] ).transpose ()


df1.columns=[(gf[0], f'{gf[1]}' if not gf[-1] else f'{gf[1]}_{gf[-1]}') for gf in df1. columns. values. tolist()]

However, I wonder whether there is a way to avoid the for-loop

all_df = []
for nref in dict_ref:

    df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
    df_ch ['ch'] = nref
    all_df.append ( df_ch )

Not as important as the above question, but it will be a bonus if there is way to avoid the following lines altogether

remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t   ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]

df.columns = pd.MultiIndex.from_tuples ( remove_nan )


df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
                                               "level_1": "ch",  "level_2": "feature","level_3": "region"} )

Expected Output

   (hjor_com, F2)  (hjor_com, F4)  ...  (pow_fr, occ_band0)  (pow_fr, occ_band1)
0        0.791725        0.528895  ...             0.430621             0.768834
1        0.461479        0.780529  ...             0.399188             0.851316
2        0.018790        0.617635  ...             0.393202             0.594448
3        0.210383        0.128926  ...             0.528570             0.248629

[4 rows x 18 columns]

CodePudding user response：

You can flip dict_ref so each item in the value array becomes the key, perform a replacement, and group by the new ch:

mapping = {
    v: key for key, value in dict_ref.items() for v in value
}
all_df = df.replace({"ch": mapping}).groupby(["group_feature", "feature", "ch"]).mean().reset_index()

df1 = pd.concat([df, all_df])...