Home > Net >  Computing probability for each row in a dataframe
Computing probability for each row in a dataframe

Time:11-30

Suppose we have the following dataframe and would like to compute the probabilities of frequencies between B and C.

data = pd.DataFrame({'id_' : [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
        'A' : [1608, 1608, 2089, 213, 1005, 1887, 2089, 4544, 6866, 2020, 2020],
                   'B' : [1772, 1772, 1608, 1608, 1790, 1790, 1791, 1791, 1772, 1799, 1799],
                        'C': [1772,1608, 1005,1791, 4544, 2020, 1791, 1772, 1799, 2020, 213],
                       })

I have run the crosstab to compute the frequency of B and C:

df = pd.crosstab(data['B'], data['C'])
print(df)

C     213   1005  1608  1772  1791  1799  2020  4544
B                                                   
1608     0     1     0     0     1     0     0     0
1772     0     0     1     1     0     1     0     0
1790     0     0     0     0     0     0     1     1
1791     0     0     0     1     1     0     0     0
1799     1     0     0     0     0     0     1     0

Now I would like to calculate the probability of each row element-wise so that the output could look as follows:

        213   1005  1608  1772  1791  1799  2020  4544                                                  
1608     0    0.5    0     0     0.5   0     0     0
1772     0     0     0.33  0.33  0     0.33  0     0
1790     0     0     0     0     0     0     0.5   0.5
1791     0     0     0     0.5   0.5   0     0     0
1799     0.5   0     0     0     0     0     0.5   0

I have tried the following:

prob = [i/sum(i) for i in range(df)]

and I got this error:

TypeError: 'DataFrame' object cannot be interpreted as an integer

I read about the error here why-does-dataframe-object-cannot-be-interpreted-as-an-integer I tried following the advice but it didn't work. I also read another solution here Compute percentage for each row in pandas which applies

df.iloc[:, 1:].apply(lambda x: x / x.sum())

but the probabilities I got are not accurate.

If there is another way to get the probabilities without crosstab, that would also be helpful.

CodePudding user response:

You need to do this instead:

pd.crosstab(data.B,data.C, normalize='index').round(4)*100

which gives:

C     213   1005   1608   1772  1791   1799  2020  4544
B                                                      
1608   0.0  50.0   0.00   0.00  50.0   0.00   0.0   0.0
1772   0.0   0.0  33.33  33.33   0.0  33.33   0.0   0.0
1790   0.0   0.0   0.00   0.00   0.0   0.00  50.0  50.0
1791   0.0   0.0   0.00  50.00  50.0   0.00   0.0   0.0
1799  50.0   0.0   0.00   0.00   0.0   0.00  50.0   0.0

or

print(pd.crosstab(data.B,data.C, normalize='index').round(2))

which is:

C     213   1005  1608  1772  1791  1799  2020  4544
B                                                   
1608   0.0   0.5  0.00  0.00   0.5  0.00   0.0   0.0
1772   0.0   0.0  0.33  0.33   0.0  0.33   0.0   0.0
1790   0.0   0.0  0.00  0.00   0.0  0.00   0.5   0.5
1791   0.0   0.0  0.00  0.50   0.5  0.00   0.0   0.0
1799   0.5   0.0  0.00  0.00   0.0  0.00   0.5   0.0
  • Related