pd.dataframe: how to count values of a variable and find probabilities-CodePudding

Here is my data:

df1 = pd.DataFrame()
df1['a1'] = ['ABC','ACC','BCC','ABC','ABC','ACC','BCC']
df1['b1'] = ['ACC','AAC','BAC','ACC','ACC','AAC','BAC']
df1['group'] = ['A1','A2','A1','A3','A2','A1','A1']
df1['names'] = ['n1','n2','n3','n4','n1','n3','n3']

df2 = pd.DataFrame()
df2['a2'] = ['ACC','BCC','ABC']
df2['b2'] = ['AAC','BAC','ACC']
df2['types'] = ['t1','t2','t3']

DF = pd.merge(df1, df2, left_on=['a1','b1'], right_on=['a2','b2'])

>>> DF.sort_values('group')
    a1   b1 group names   a2   b2 types
0  ABC  ACC    A1    n1  ABC  ACC    t3
4  ACC  AAC    A1    n3  ACC  AAC    t1
5  BCC  BAC    A1    n3  BCC  BAC    t2
6  BCC  BAC    A1    n3  BCC  BAC    t2
2  ABC  ACC    A2    n1  ABC  ACC    t3
3  ACC  AAC    A2    n2  ACC  AAC    t1
1  ABC  ACC    A3    n4  ABC  ACC    t3

I want to calculate the probability of each type occurring among the total occurring times (nrow of df) on each name and then, sum them for each group.

For example, for group A1:

for n1: 
P_1 = P(t1_n1) P(t2_n1) P(t3_n1) = 0 0 1/7 = 1/7
for n2: 
P_2 = P(t1_n2) P(t2_n2) P(t3_n2) = 0
for n3: 
P_3 = P(t1_n3) P(t2_n3) P(t3_n3) = 1/7 0 2/7 = 3/7
for n4:
P_4 = P(t1_n4) P(t2_n4) P(t3_n4) = 0 

P_total = P_1 P_2 P_3 P_4

Expected output:

   groups   P_n1   P_n2  P_n3   P_n4  P_total
0  A1        1/7     0     3/7    0   4/7
1  A2        ....
2  A3        
3  A4

how can I accomplish my goal without many loop functions, in an elegant way? Thanks

CodePudding user response：

You can use pd.crosstab with normalize=True :

pd.crosstab(DF['group'],DF['names'],normalize=True)

names        n1        n2        n3        n4
group                                        
A1     0.142857  0.000000  0.428571  0.000000
A2     0.142857  0.142857  0.000000  0.000000
A3     0.000000  0.000000  0.000000  0.142857

To get you the total etc:

pd.crosstab(DF['group'],DF['names'],normalize=True)\
.assign(total = lambda x : x.sum(axis=1)).reset_index()

names group        n1        n2        n3        n4     total
0        A1  0.142857  0.000000  0.428571  0.000000  0.571429
1        A2  0.142857  0.142857  0.000000  0.000000  0.285714
2        A3  0.000000  0.000000  0.000000  0.142857  0.142857

CodePudding user response：

The computation of the Xjs in your example is confusing, but based on the following description

I want to calculate the probability of each type happening on each name and sum them for each group.

it seems that you want

prob_types_by_name = (
    DF.groupby('names')['types']       # compute the probabilities of each 
      .value_counts(normalize=True)    # type happening on each name group  
      .rename('prob_types')    # name the resulting Series as 'prob_types'
)

res = (
    DF.merge(prob_types_by_name,                # merge the result to 
             on=['names','types'], how='left')  # the original DF
      .groupby('group')['prob_types']    # sum the 'prob_types'  
      .sum()                             # per 'group' 
)

Input

>>> DF

    a1   b1 group names   a2   b2 types
0  ABC  ACC    A1    n1  ABC  ACC    t3
4  ACC  AAC    A1    n3  ACC  AAC    t1
5  BCC  BAC    A1    n3  BCC  BAC    t2
6  BCC  BAC    A1    n3  BCC  BAC    t2
2  ABC  ACC    A2    n1  ABC  ACC    t3
3  ACC  AAC    A2    n2  ACC  AAC    t1
1  ABC  ACC    A3    n4  ABC  ACC    t3

Output

>>> res

group
A1    2.666667
A2    2.000000
A3    1.000000
Name: prob_types, dtype: float64