For example I have this kind of dataframe
df = pd.DataFrame({
'A': [True, True, False, True, False],
'B': [False, False, True, True, True],
'C': [False, True, True, True, True],
'D': [False, False, False, True, False],
'E': [False, True, True, False, False]})
A B C D E
0 True False False False False
1 True False True False True
2 False True True False True
3 True True True True False
4 False True True False False
I want to create a table that calculate the percentage of subset between two column that has True value based on this formula
(x,y) = countIsAllTrue( col_x & col_y ) / countIsTrue(col_x)
in this example, the output should be like this (I put the calculation process here just for explanation how to use the formula that I talk about)
A B C D E
A 3/3=1 1/3 2/3 1/3 1/3
B 1/3 3/3=1 3/3 1/3 1/3
C 2/4 3/4 4/4=1 1/4 2/4
D 1/1=1 1/1=1 1/1=1 1/1=1 0/1=0
E 1/2 1/2 2/2=1 0/2=0 2/2=1
Anyone can help me to apply this formula?
Note: I used boolean because I thought it is better than using 1/0, and can be counted or sum as 1/0, or is it actually doesnt really matter of which one I used?
CodePudding user response:
Let us do dot
product
s = df.astype(int)
(s.T @ s).div(s.sum(), axis=0)
A B C D E
A 1.000000 0.333333 0.666667 0.333333 0.333333
B 0.333333 1.000000 1.000000 0.333333 0.333333
C 0.500000 0.750000 1.000000 0.250000 0.500000
D 1.000000 1.000000 1.000000 1.000000 0.000000
E 0.500000 0.500000 1.000000 0.000000 1.000000