Home > front end >  Creating formula percentage between 2 column in dataframe
Creating formula percentage between 2 column in dataframe

Time:07-11

For example I have this kind of dataframe

df = pd.DataFrame({
    'A': [True, True, False, True, False],
    'B': [False, False, True, True, True],
    'C': [False, True, True, True, True],
    'D': [False, False, False, True, False],
    'E': [False, True, True, False, False]})

    A       B       C       D       E
0   True    False   False   False   False
1   True    False   True    False   True
2   False   True    True    False   True
3   True    True    True    True    False
4   False   True    True    False   False

I want to create a table that calculate the percentage of subset between two column that has True value based on this formula

(x,y) = countIsAllTrue( col_x & col_y ) / countIsTrue(col_x)

in this example, the output should be like this (I put the calculation process here just for explanation how to use the formula that I talk about)

    A       B       C       D       E
A   3/3=1   1/3     2/3     1/3     1/3
B   1/3     3/3=1   3/3     1/3     1/3
C   2/4     3/4     4/4=1   1/4     2/4
D   1/1=1   1/1=1   1/1=1   1/1=1   0/1=0
E   1/2     1/2     2/2=1   0/2=0   2/2=1

Anyone can help me to apply this formula?

Note: I used boolean because I thought it is better than using 1/0, and can be counted or sum as 1/0, or is it actually doesnt really matter of which one I used?

CodePudding user response:

Let us do dot product

s = df.astype(int)
(s.T @ s).div(s.sum(), axis=0)

          A         B         C         D         E
A  1.000000  0.333333  0.666667  0.333333  0.333333
B  0.333333  1.000000  1.000000  0.333333  0.333333
C  0.500000  0.750000  1.000000  0.250000  0.500000
D  1.000000  1.000000  1.000000  1.000000  0.000000
E  0.500000  0.500000  1.000000  0.000000  1.000000
  • Related