Pandas: Concise way of applying different functions across a multiindex column-CodePudding

I have a multi-index dataframe. I want to create a new column whose value is a function of other columns. The problem is that the function is different for a small number of levels.

In order to do this, I am having to manually define the calculation for every leaf level in the hierarchical dataset. This is undesirable because most of the levels use the same calulation.

Here is an example of what I am doing, and how I currently have done it. NB: The data and functions are contrived for simplicity - actual use case is far more unweildy.

import pandas as pd
from io import StringIO

testdata = """
level1,level2,value1,value2
root1,child1,10,20
root1,child2,30,40
root1,child3,50,60
root1,child4,70,80
root1,child5,90,100
"""

df = pd.read_csv(StringIO(testdata), index_col=[0,1], header=[0])
print('Starting Point:')
print(df)

df = df.unstack('level2')
print('Unstacked Version allowing me to define a different function for each level.')
print(df)

# This is the bit I'd like to make simpler. Imagine there was 20 of these child levels and only
# the last 2 were special cases.
df[('derived', 'child1')] = df[('value1', 'child1')]   df[('value2', 'child1')] 
df[('derived', 'child2')] = df[('value1', 'child2')]   df[('value2', 'child2')] 
df[('derived', 'child3')] = df[('value1', 'child3')]   df[('value2', 'child3')] 
df[('derived', 'child4')] = 0.0
df[('derived', 'child5')] = df[('value1', 'child5')] * df[('value2', 'child5')]

print('Desired outcome:')
df = df.stack()
print(df)

Output:

Starting Point:
               value1  value2
level1 level2
root1   child1      10      20
       child2      30      40
       child3      50      60
       child4      70      80
       child5      90     100
Unstacked Version allowing me to define a different function for each level.
       value1                             value2
level2 child1 child2 child3 child4 child5 child1 child2 child3 child4 child5
level1
root1       10     30     50     70     90     20     40     60     80    100
Desired outcome:
               value1  value2  derived
level1 level2
root1   child1      10      20     30.0
       child2      30      40     70.0
       child3      50      60    110.0
       child4      70      80      0.0
       child5      90     100   9000.0

CodePudding user response：

Since "only the last 2 were special cases" you can reset the index, perform vectorized computations on slices and recover the index back:

df = df.reset_index()
df.loc[df.index[:-2], 'derived'] = df['value1']   df['value2']
df.loc[df.index[-2], 'derived'] = 0
df.loc[df.index[-1], 'derived'] = df.loc[df.index[-1], 'value1'] * df.loc[df.index[-1], 'value2']
df.set_index(['level1', 'level2'], inplace=True)

print(df)

              value1  value2  derived
level1 level2                         
root   child1      10      20     30.0
       child2      30      40     70.0
       child3      50      60    110.0
       child4      70      80      0.0
       child5      90     100   9000.0

CodePudding user response：

We can use the original df without stacking:

from io import StringIO

testdata = """
level1,level2,value1,value2
root1,child1,10,20
root1,child2,30,40
root1,child3,50,60
root1,child4,70,80
root1,child5,90,100
"""

df = pd.read_csv(StringIO(testdata), index_col=[0,1], header=[0])

level2 = df.index.get_level_values('level2')

cond = [level2 == 'child5', level2 == 'child4']

result = [df.prod(axis=1), 0]

derived = np.select(cond, result, default = df.sum(axis=1))

df.assign(derived = derived)
               value1  value2  derived
level1 level2                         
root1  child1      10      20       30
       child2      30      40       70
       child3      50      60      110
       child4      70      80        0
       child5      90     100     9000

CodePudding user response：

Using costume functions and lambda:

def func1(cols):
    return cols["value1"]   cols["value2"]
    
def func2(cols):
    return 0.0
    
def func3(cols):
    return cols["value1"] * cols["value2"]

df["derived"] = df.apply(lambda cols: func1(cols) if cols.name[1] != "child4" 
                            and cols.name[1] != "child5" else (func2(cols) 
                            if cols.name[1] == "child4" 
                            else func3(cols)), axis=1)
print(df)

You can also choose to simplify the lambda function using a pre-defined dictionary:

funcs = {"child1": func1, "child2": func1, "child3": func1, "child4": func2, "child5": func3}
df["derived"] = df.apply(lambda cols: funcs[cols.name[1]](cols), axis=1)
print(df)

               value1  value2  derived
level1 level2                         
root1  child1      10      20     30.0
       child2      30      40     70.0
       child3      50      60    110.0
       child4      70      80      0.0
       child5      90     100   9000.0