How to Apply a function to multiple multiindex columns in Pandas?-CodePudding

Given a multiindex columns

         a                      ...                              
         E1                      ...        E3                    
         g1        g2        g3  ...        g1        g2        g3
0  0.548814  0.715189  0.602763  ...  0.437587  0.891773  0.963663
1  0.383442  0.791725  0.528895  ...  0.087129  0.020218  0.832620
2  0.778157  0.870012  0.978618  ...  0.118274  0.639921  0.143353
3  0.944669  0.521848  0.414662  ...  0.568434  0.018790  0.617635
4  0.612096  0.616934  0.943748  ...  0.697631  0.060225  0.666767
5  0.670638  0.210383  0.128926  ...  0.438602  0.988374  0.102045
6  0.208877  0.161310  0.653108  ...  0.158970  0.110375  0.656330
7  0.138183  0.196582  0.368725  ...  0.096098  0.976459  0.468651
8  0.976761  0.604846  0.739264  ...  0.296140  0.118728  0.317983
9  0.414263  0.064147  0.692472  ...  0.093941  0.575946  0.929296
[10 rows x 9 columns]

I would like to apply multiple column filtered by the second level (i.e., E1, E2, E3) to a functions (e.g., ration_type1 ,ration_type2, or can be more in actual implementation).

For example. Assume we are to compute the second level of E1 under the function ration_type1 and ration_type2. Then we only process the following df

    a                    
         E1                    
         g1        g2        g3
0  0.548814  0.715189  0.602763
1  0.383442  0.791725  0.528895
  .................
8  0.976761  0.604846  0.739264
9  0.414263  0.064147  0.692472

To generalise all second level, I than rely on the list comprehension as below

for each of the ration_type1 and ration_type2.

all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]


all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]

before concat it back to the orignal df.

However, I wonder whether there is more elegant and compact way than the list comprehension approach. This is because, in real life implementation, there can be more ration function.

The full code is as below

 import numpy as np

import pandas as pd

np.random.seed(0)

arr = np.random.rand(10,9)

tuples = [('a', 'E1', 'g1'), ('a', 'E1', 'g2'), ('a', 'E1', 'g3'), ('a', 'E2', 'g1'), ('a', 'E2', 'g2'),
          ('a', 'E2', 'g3'), ('a', 'E3', 'g1'), ('a', 'E3', 'g2'), ('a', 'E3', 'g3')]
df = pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples(tuples))

print(df)
def ration_type1(df):
    """
    (g3 g2)/g1
    # Ugly way since have to convert to numpy 1st
    """

    print(df)
    dration = 'ration_type1'
    l1, l2, _ = df.columns.tolist()[0]
    total = df.loc[:, (slice(None), slice(None), 'g2')].to_numpy()   \
            df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
    arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()

    return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))


def ration_type2(df):
    """
    (g3 g2 g1)/g1
    # Ugly way since have to convert to numpy 1st
    """
    dration = 'ration_type2'
    l1, l2, _ = df.columns.tolist()[0]
    total = df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()   \
            df.loc[:, (slice(None), slice(None), 'g2')].to_numpy()   \
            df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
    arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()

    return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))


level1_name = list(set(df.columns.get_level_values(1)))

all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]

df1 = pd.concat(all_df1, axis=1)
df2 = pd.concat(all_df2, axis=1)

df=pd.concat([df,df1,df2],axis=1)

Expected output.

          a                      ...                                       
         E1                      ...                        E2           E3
         g1        g2        g3  ... ration_type2 ration_type2 ration_type2
0  0.548814  0.715189  0.602763  ...     3.401458     2.962896     5.240151
1  0.383442  0.791725  0.528895  ...     4.444124     2.754497    10.788191
2  0.778157  0.870012  0.978618  ...     3.375653     2.554145     7.622516
3  0.944669  0.521848  0.414662  ...     1.991363     5.650758     2.119612
4  0.612096  0.616934  0.943748  ...     3.549735     2.168255     2.042087
5  0.670638  0.210383  0.128926  ...     1.505949     3.960760     3.486126
6  0.208877  0.161310  0.653108  ...     4.899035     3.806001     5.822965
7  0.138183  0.196582  0.368725  ...     5.091008     2.138921    16.037821
8  0.976761  0.604846  0.739264  ...     2.376088    11.283905     2.474676
9  0.414263  0.064147  0.692472  ...     2.826423     2.391873    17.023361

[10 rows x 15 columns]

I am thinking to do something like using apply

# function for prepending 'Geek'
def multiply_by_2(number):
    return 2 * number
 
# executing the function
df[["Integers", "Float"]] = df[["Integers", "Float"]].apply(multiply_by_2)

But, I am having difficulties (due to my limited knowledge)to do it since my example involving multiindex columns

CodePudding user response：

Not so easy if working with MultiIndex - solution filter levels with rename g values to ration_type1, ration_type2 for possible divide MultiIndex DataFrames:

idx = pd.IndexSlice
c = {'g1':'ration_type1','g2':'ration_type1','g3':'ration_type1'}
df1 = df.loc[:, idx[:,:,['g3','g2']]].rename(columns=c).groupby(level=[0,1,2], axis=1).sum()
df11 = df1.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c))


c1 = {'g1':'ration_type2','g2':'ration_type2','g3':'ration_type2'}
df2 = df.rename(columns=c1).groupby(level=[0,1,2], axis=1).sum()
df22 = df2.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c1))

df=pd.concat([df,df11,df22],axis=1)

Simplier is reshape first:

df1 = df.stack([0,1])
df1['ration_type1'] = df1[['g2','g3']].sum(axis=1).div(df1['g1'])
df1['ration_type2'] = df1.sum(axis=1).div(df1['g1'])

print(df1)
              g1        g2        g3  ration_type1  ration_type2
0 a E1  0.548814  0.715189  0.602763      2.401458      7.777186
    E2  0.544883  0.423655  0.645894      1.962896      6.565312
    E3  0.437587  0.891773  0.963663      4.240151     14.929992
1 a E1  0.383442  0.791725  0.528895      3.444124     13.426259
    E2  0.568045  0.925597  0.071036      1.754497      5.843159
    E3  0.087129  0.020218  0.832620      9.788191    123.129174
2 a E1  0.778157  0.870012  0.978618      2.375653      6.428577
    E2  0.799159  0.461479  0.780529      1.554145      4.498872
    E3  0.118274  0.639921  0.143353      6.622516     63.615316
3 a E1  0.944669  0.521848  0.414662      0.991363      3.040793
    E2  0.264556  0.774234  0.456150      4.650758     23.230266
    E3  0.568434  0.018790  0.617635      1.119612      4.089254
4 a E1  0.612096  0.616934  0.943748      2.549735      7.715318
    E2  0.681820  0.359508  0.437032      1.168255      3.881690
    E3  0.697631  0.060225  0.666767      1.042087      3.535837
5 a E1  0.670638  0.210383  0.128926      0.505949      2.260380
    E2  0.315428  0.363711  0.570197      2.960760     13.347233
    E3  0.438602  0.988374  0.102045      2.486126      9.154429
6 a E1  0.208877  0.161310  0.653108      3.899035     23.565714
    E2  0.253292  0.466311  0.244426      2.806001     14.884143
    E3  0.158970  0.110375  0.656330      4.822965     36.161882
7 a E1  0.138183  0.196582  0.368725      4.091008     34.696743
    E2  0.820993  0.097101  0.837945      1.138921      3.526168
    E3  0.096098  0.976459  0.468651     15.037821    172.521382
8 a E1  0.976761  0.604846  0.739264      1.376088      3.784915
    E2  0.039188  0.282807  0.120197     10.283905    273.710140
    E3  0.296140  0.118728  0.317983      1.474676      7.454332
9 a E1  0.414263  0.064147  0.692472      1.826423      7.235273
    E2  0.566601  0.265389  0.523248      1.391873      4.848404
    E3  0.093941  0.575946  0.929296     16.023361    187.592593

Last reshape to original MultiIndex:

df = df1.unstack([1,2]).reorder_levels([1,2,0], axis=1)