How to group Pandas DataFrame with MultiIndex columns?-CodePudding

I have a MultiIndex column Pandas DataFrame A:

       foo    
       bar   baz
s1_a     1     2
s1_b     3     4
s2_a     5     6
s2_b     7     8

I want to group the data on a key from another DataFrame B:

       key
s1_a     1
s1_b     1
s2_a     2
s2_b     2

For a DataFrame without a MultiIndex, I would do:

pd.merge(A, B, left_index=True, right_index=True).groupby('key').sum()

but this does not work with the MultiIndex. The desired result is

    foo
    bar    baz
1     4      6
2    12     14

How can I achieve this?

CodePudding user response：

You can subset the level and restore is after the merge using pandas.concat:

C = pd.concat({'foo': (pd.merge(A['foo'], B, left_index=True, right_index=True)
                         .groupby('key')
                         .sum()
                       )
              }, axis=1)

output:

>>> C
    foo    
    bar baz
key        
1     4   6
2    12  14

NB. technically, the code you used should work with a FutureWarning, however you lose the MultiIndex an get tuples instead

>>> pd.merge(A, B, left_index=True, right_index=True).groupby('key').sum()
     (foo, bar)  (foo, baz)
key                        
1             4           6
2            12          14

FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left,1 on the right)

CodePudding user response：

Series can be passed to groupby directly and the grouper will do appropriate index alignment, so it is possible to do:

A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1     4   6
2    12  14

Setup:

import numpy as np
import pandas as pd

idx = ['s1_a', 's1_b', 's2_a', 's2_b']
A = pd.DataFrame(
    np.arange(1, 9).reshape((-1, 2)),
    index=idx,
    columns=pd.MultiIndex.from_product([['foo'], ['bar', 'baz']])
)

B = pd.DataFrame({'key': [1, 1, 2, 2]}, index=idx)

Note this will work in many cases but is less durable than merging:

B variant 1:

B = pd.DataFrame({'key': [1, 2, 2]}, index=['s1_a', 's1_b', 's2_b'])
      key
s1_a    1  # No s2_a
s1_b    2
s2_b    2


A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1.0   1   2
2.0  10  12

B variant 2:

B = pd.DataFrame({'key': [1, 1, 2, 2]}, index=['s1_a', 's2_a', 's1_b', 's2_b'])
      key
s1_a    1
s2_a    1  # s1_a/s2_a together
s1_b    2
s2_b    2  # s1_b/s2_b together

A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1     6   8
2    10  12