Apply function for multiple levels of tables/data-CodePudding

I get a problem in my work. I have tables:

import pandas as pd
import numpy as np

level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
columns=['name', 'value'])

   name value
0   a   3
1   b   x
2   c   x

I want to sum the value column, but it contains “x”s. So I will have to use the second table to calculate “x”s :

level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'], ['b1', 'b2', 'c1', 'c2', 'c3'], [5, 7, 2, 'x', 9])), 
columns=['name', 'sub', 'value'])

  name  sub value
0   b   b1  5
1   b   b2  7
2   c   c1  2
3   c   c2  x
4   c   c3  9

I should sum the b1, b2 to give “x” for b in level1 table (x=12). But for c, there is “x”, so a third level table:

level3 = pd.DataFrame(list(zip(['c', 'c', 'c'], ['c1', 'c2', 'c3'], [2, 4, 9])), 
columns=['name', 'sub', 'value'])

  name  sub value
0   c   c1  2
1   c   c2  4
2   c   c3  9

Now, we can get the sum value for value column in level1 table.

My question is: can we use a function to calculate it easily? If there are more levels, how can we loop them till no “x”?

It is OK to combine level2 and level3.

CodePudding user response：

One option is a combination of merge(multiple merges actually) and a groupby:

(level2
.merge(level3, on = ['name', 'sub'], how = 'left', suffixes = (None, '_y'))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_y, df.value))
.groupby('name', as_index = False)
.value
.sum()
.merge(level1, on = 'name', how = 'right', suffixes = ('_x',None))
.assign(value = lambda df: np.where(df.value.eq('x'), df.value_x, df.value))
.loc[:, ['name', 'value']]
)

  name value
0    a     3
1    b  12.0
2    c  15.0

CodePudding user response：

Here's a way using combine_first and replace:

from functools import reduce
l1 = level1.assign(sub=level1['name'] '1').replace('x', np.nan).set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan).set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan).set_index(['name', 'sub'])

reduce(lambda x, y: x.combine_first(y), [l3,l2,l1]).groupby(level=0).sum()

Output:

      value
name       
a       3.0
b      12.0
c      15.0

Complete example:

import pandas as pd
import numpy as np

level1 = pd.DataFrame(list(zip(['a', 'b', 'c'], [3, 'x', 'x'])),
                      columns=['name', 'value'])

level2 = pd.DataFrame(list(zip(['b', 'b', 'c', 'c', 'c'], 
                               ['b1', 'b2', 'c1', 'c2', 'c3'], 
                               [5, 7, 2, 'x', 9])), 
                      columns=['name', 'sub', 'value'])

level3 = pd.DataFrame(list(zip(['c', 'c', 'c'], 
                               ['c1', 'c2', 'c3'], 
                               [2, 4, 9])),
                      columns=['name', 'sub', 'value'])

from functools import reduce
l1 = level1.assign(sub=level1['name'] '1')\
           .replace('x', np.nan)\
           .set_index(['name', 'sub'])
l2 = level2.replace('x', np.nan)\
           .set_index(['name', 'sub'])
l3 = level3.replace('x', np.nan)\
           .set_index(['name', 'sub'])

out = reduce(lambda x, y: x.combine_first(y), 
             [l3,l2,l1]).groupby(level=0).sum()
print(out)