Home > Enterprise >  How to group Pandas DataFrame with MultiIndex columns?
How to group Pandas DataFrame with MultiIndex columns?

Time:09-23

I have a MultiIndex column Pandas DataFrame A:

       foo    
       bar   baz
s1_a     1     2
s1_b     3     4
s2_a     5     6
s2_b     7     8

I want to group the data on a key from another DataFrame B:

       key
s1_a     1
s1_b     1
s2_a     2
s2_b     2

For a DataFrame without a MultiIndex, I would do:

pd.merge(A, B, left_index=True, right_index=True).groupby('key').sum()

but this does not work with the MultiIndex. The desired result is

    foo
    bar    baz
1     4      6
2    12     14

How can I achieve this?

CodePudding user response:

You can subset the level and restore is after the merge using pandas.concat:

C = pd.concat({'foo': (pd.merge(A['foo'], B, left_index=True, right_index=True)
                         .groupby('key')
                         .sum()
                       )
              }, axis=1)

output:

>>> C
    foo    
    bar baz
key        
1     4   6
2    12  14

NB. technically, the code you used should work with a FutureWarning, however you lose the MultiIndex an get tuples instead

>>> pd.merge(A, B, left_index=True, right_index=True).groupby('key').sum()
     (foo, bar)  (foo, baz)
key                        
1             4           6
2            12          14

FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left,1 on the right)

CodePudding user response:

Series can be passed to groupby directly and the grouper will do appropriate index alignment, so it is possible to do:

A.groupby(B['key']).sum()
    foo    
    bar baz
key        
1     4   6
2    12  14

Setup:

import numpy as np
import pandas as pd

idx = ['s1_a', 's1_b', 's2_a', 's2_b']
A = pd.DataFrame(
    np.arange(1, 9).reshape((-1, 2)),
    index=idx,
    columns=pd.MultiIndex.from_product([['foo'], ['bar', 'baz']])
)

B = pd.DataFrame({'key': [1, 1, 2, 2]}, index=idx)

Note this will work in many cases but is less durable than merging:

B variant 1:

B = pd.DataFrame({'key': [1, 2, 2]}, index=['s1_a', 's1_b', 's2_b'])
      key
s1_a    1  # No s2_a
s1_b    2
s2_b    2


A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1.0   1   2
2.0  10  12

B variant 2:

B = pd.DataFrame({'key': [1, 1, 2, 2]}, index=['s1_a', 's2_a', 's1_b', 's2_b'])
      key
s1_a    1
s2_a    1  # s1_a/s2_a together
s1_b    2
s2_b    2  # s1_b/s2_b together

A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1     6   8
2    10  12
  • Related