Home > Software design >  Operation on MultiIndex Dataframe with condition
Operation on MultiIndex Dataframe with condition

Time:11-11

I have a Pandas MultiIndex Dataframe in Python with two levels for the index and for the columns that look like this:

miind = pd.MultiIndex.from_product([['A1','A2'],['B1','B2','B3']])
micol = pd.MultiIndex.from_product([['X1','X2'],['Y1','Y2','Y3']])
df = pd.DataFrame((np.arange(len(miind)*len(micol)) % 5).reshape(len(miind),len(micol)),
    index=miind, columns=micol)
print(df)
      X1       X2      
      Y1 Y2 Y3 Y1 Y2 Y3
A1 B1  0  1  2  3  4  0
   B2  1  2  3  4  0  1
   B3  2  3  4  0  1  2
A2 B1  3  4  0  1  2  3
   B2  4  0  1  2  3  4
   B3  0  1  2  3  4  0

I would like, for each line, to divide Y1 and Y2 by Y3 when Y3 is not 0. I don't know how to combine the condition Y3>0 with the element selection.

What is the best way to do this? np.where(), a mask, or simply indexing? I access Y3 as follows:

idx = pd.IndexSlice
print(df.loc[idx[:,:],idx[:,'Y3']] > 0)
          X1     X2
          Y3     Y3
A1 B1   True  False
   B2   True   True
   B3   True   True
A2 B1  False   True
   B2   True   True
   B3   True  False

Edit:

This is what I want, using for loops:

A = ['A1','A2']
B = ['B1','B2','B3']
X = ['X1','X2']
Y = ['Y1','Y2','Y3']
miind = pd.MultiIndex.from_product([A,B])
micol = pd.MultiIndex.from_product([X,Y])
df = pd.DataFrame((np.arange(len(miind)*len(micol)) % 5).reshape(len(miind),len(micol)), 
                  index=miind, columns=micol)
for i, a in enumerate(A):
    df1 =  df.loc[a]
    for j,b in enumerate(B):
        df2 = df1.loc[b]
        for k,x in enumerate(X):
            s1 = df2.loc[x]
            if s1['Y3'] > 0:
                df.loc[idx[a,b],idx[x,'Y1']] /= s1['Y3']
                df.loc[idx[a,b],idx[x,'Y2']] /= s1['Y3']
print(df)
             X1                     X2             
             Y1        Y2 Y3        Y1        Y2 Y3
A1 B1  0.000000  0.500000  2  3.000000  4.000000  0
   B2  0.333333  0.666667  3  4.000000  0.000000  1
   B3  0.500000  0.750000  4  0.000000  0.500000  2
A2 B1  3.000000  4.000000  0  0.333333  0.666667  3
   B2  4.000000  0.000000  1  0.500000  0.750000  4
   B3  0.000000  0.500000  2  3.000000  4.000000  0

However, this solution is not elegant and does probably not scale well for bigger DataFrames...

CodePudding user response:

You could stack and unstack your dataframe:

# stack the dataframe
tmp = df.stack(level=0)

# divide the columns of the stacked dataframe
tmp.loc[tmp['Y3']!= 0, 'Y1'] /= tmp.loc[tmp['Y3']!= 0, 'Y3']
tmp.loc[tmp['Y3']!= 0, 'Y2'] /= tmp.loc[tmp['Y3']!= 0, 'Y3']

# unstack the divided dataframe
tmp = tmp.unstack(level=2)

At this point, we have:

             Y1                  Y2           Y3   
             X1        X2        X1        X2 X1 X2
A1 B1  0.000000  3.000000  0.500000  4.000000  2  0
   B2  0.333333  4.000000  0.666667  0.000000  3  1
   B3  0.500000  0.000000  0.750000  0.500000  4  2
A2 B1  3.000000  0.333333  4.000000  0.666667  0  3
   B2  4.000000  0.500000  0.000000  0.750000  1  4
   B3  0.000000  3.000000  0.500000  4.000000  2  0

Not that bad, the the levels of the columns are not what we want. Let us go on...

# reverse the column levels
tmp.columns = pd.MultiIndex.from_tuples((j,i) for i,j in tmp.columns)

# and sort the columns
result = tmp.sort_index(axis=1)

We now have as expected:

             X1                     X2             
             Y1        Y2 Y3        Y1        Y2 Y3
A1 B1  0.000000  0.500000  2  3.000000  4.000000  0
   B2  0.333333  0.666667  3  4.000000  0.000000  1
   B3  0.500000  0.750000  4  0.000000  0.500000  2
A2 B1  3.000000  4.000000  0  0.333333  0.666667  3
   B2  4.000000  0.000000  1  0.500000  0.750000  4
   B3  0.000000  0.500000  2  3.000000  4.000000  0
  • Related