I have a multi-index dataframe. I want to create a new column whose value is a function of other columns. The problem is that the function is different for a small number of levels.
In order to do this, I am having to manually define the calculation for every leaf level in the hierarchical dataset. This is undesirable because most of the levels use the same calulation.
Here is an example of what I am doing, and how I currently have done it. NB: The data and functions are contrived for simplicity - actual use case is far more unweildy.
import pandas as pd
from io import StringIO
testdata = """
level1,level2,value1,value2
root1,child1,10,20
root1,child2,30,40
root1,child3,50,60
root1,child4,70,80
root1,child5,90,100
"""
df = pd.read_csv(StringIO(testdata), index_col=[0,1], header=[0])
print('Starting Point:')
print(df)
df = df.unstack('level2')
print('Unstacked Version allowing me to define a different function for each level.')
print(df)
# This is the bit I'd like to make simpler. Imagine there was 20 of these child levels and only
# the last 2 were special cases.
df[('derived', 'child1')] = df[('value1', 'child1')] df[('value2', 'child1')]
df[('derived', 'child2')] = df[('value1', 'child2')] df[('value2', 'child2')]
df[('derived', 'child3')] = df[('value1', 'child3')] df[('value2', 'child3')]
df[('derived', 'child4')] = 0.0
df[('derived', 'child5')] = df[('value1', 'child5')] * df[('value2', 'child5')]
print('Desired outcome:')
df = df.stack()
print(df)
Output:
Starting Point:
value1 value2
level1 level2
root1 child1 10 20
child2 30 40
child3 50 60
child4 70 80
child5 90 100
Unstacked Version allowing me to define a different function for each level.
value1 value2
level2 child1 child2 child3 child4 child5 child1 child2 child3 child4 child5
level1
root1 10 30 50 70 90 20 40 60 80 100
Desired outcome:
value1 value2 derived
level1 level2
root1 child1 10 20 30.0
child2 30 40 70.0
child3 50 60 110.0
child4 70 80 0.0
child5 90 100 9000.0
CodePudding user response:
Since "only the last 2 were special cases" you can reset the index, perform vectorized computations on slices and recover the index back:
df = df.reset_index()
df.loc[df.index[:-2], 'derived'] = df['value1'] df['value2']
df.loc[df.index[-2], 'derived'] = 0
df.loc[df.index[-1], 'derived'] = df.loc[df.index[-1], 'value1'] * df.loc[df.index[-1], 'value2']
df.set_index(['level1', 'level2'], inplace=True)
print(df)
value1 value2 derived
level1 level2
root child1 10 20 30.0
child2 30 40 70.0
child3 50 60 110.0
child4 70 80 0.0
child5 90 100 9000.0
CodePudding user response:
We can use the original df without stacking:
from io import StringIO
testdata = """
level1,level2,value1,value2
root1,child1,10,20
root1,child2,30,40
root1,child3,50,60
root1,child4,70,80
root1,child5,90,100
"""
df = pd.read_csv(StringIO(testdata), index_col=[0,1], header=[0])
level2 = df.index.get_level_values('level2')
cond = [level2 == 'child5', level2 == 'child4']
result = [df.prod(axis=1), 0]
derived = np.select(cond, result, default = df.sum(axis=1))
df.assign(derived = derived)
value1 value2 derived
level1 level2
root1 child1 10 20 30
child2 30 40 70
child3 50 60 110
child4 70 80 0
child5 90 100 9000
CodePudding user response:
Using costume functions and lambda:
def func1(cols):
return cols["value1"] cols["value2"]
def func2(cols):
return 0.0
def func3(cols):
return cols["value1"] * cols["value2"]
df["derived"] = df.apply(lambda cols: func1(cols) if cols.name[1] != "child4"
and cols.name[1] != "child5" else (func2(cols)
if cols.name[1] == "child4"
else func3(cols)), axis=1)
print(df)
You can also choose to simplify the lambda function using a pre-defined dictionary:
funcs = {"child1": func1, "child2": func1, "child3": func1, "child4": func2, "child5": func3}
df["derived"] = df.apply(lambda cols: funcs[cols.name[1]](cols), axis=1)
print(df)
value1 value2 derived
level1 level2
root1 child1 10 20 30.0
child2 30 40 70.0
child3 50 60 110.0
child4 70 80 0.0
child5 90 100 9000.0