Home > Mobile >  Merging dataframes on multi-index AND column
Merging dataframes on multi-index AND column

Time:11-02

TL;DR: Merge two dataframes based on their multi-indices and a column that they share.

The two multi-index dataframes (call them dfA and dfB) do not have unique indices and are of different shapes. However level 0 of the index specifies group, level 1 specifies material for both. In dataframe dfA and dfB, there is a column called SR.

The correct identification would therefore involve both indices and the value of the SR column.

import pandas as pd
import numpy as np
tupA = [('G1','M1')]*3   [('G1','M2')]   [('G2','M3')]*2
indA = pd.MultiIndex.from_tuples(tupA, names = ['Group', 'Material'])

dfA = pd.DataFrame({'SR': [3,5,10, 3,5,15],
                   'ValA': [1,2,1,4,5,6]},
                   index = ind
                  )

tupB = [('G1','M1')]*2   [('G1','M2')]   [('G2','M3')]
indB = pd.MultiIndex.from_tuples(tupB, names = ['Group', 'Material'])

dfB = pd.DataFrame({'SR': [3,5,3,5],
                   'ValB': [2,4,5,6]},
                   index = indB
                  )
print(dfA,'\n', dfB) 

Yields:

                SR  ValA
Group Material          
G1    M1         3     1
      M1         5     2
      M1        10     1
      M2         3     4
G2    M3         5     5
      M3        15     6 
                 SR  ValB
Group Material          
G1    M1         3     2
      M1         5     4
      M2         3     5
G2    M3         5     6

The Task:

To merge the two dataframes based on the multi-index and the column SR. The SR entries in dfB which are not in dfA should be replaced by np.nan.

Desired Output:

The merged dataframe should be something like this:

                SR  ValA  ValB
Group Material                
G1    M1         3     1   2.0
      M1         5     2   4.0
      M1        10     1   NaN
      M2         3     4   5.0
G2    M3         5     5   6.0
      M3        15     6   NaN

It has all the rows of dfA but for those SR values which are not in dfB it has NaN.

Attempt at Solution:

I tried a number of left and outer joins but I can't get the NaN. The documentation does have an example which gives NaN but it's not using multi-indices.

Would appreciate some help.

CodePudding user response:

You can use the merge function with your three keys passed to the on argument:

dfA.merge(dfB,on=["Group","Material","SR"], how="left")

Output :

                 SR     ValA    ValB
Group   Material            
   G1      M1     3     1   2.0
           M1     5     2   4.0
           M1     10    1   NaN
           M2     3     4   5.0
 G2        M3     5     5   6.0
           M3     15    6   NaN
  • Related