How to get the percentage of categorical variables by group?-CodePudding

I have this dataframe

my_data = {"A" : ['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z'],
           "B" : ['i', 'i', 'j', 'j', 'j', 'k', 'k', 'k', 'k'],
           "C" : [0, 1, 1, 2, 0, 3, 3, 3, 0]}
pd.DataFrame(my_data)

A and B are independent groups and C is a category variable, I would like to find the percentage of C by group A and B.

There are so many categories of C, so I try to solve without using dummy variable.

like this

result = {"A" : ['x', 'x', 'y', 'y', 'z'],
          "B" : ['i', 'j', 'j', 'k', 'k'],
          "0" : [1/2, 0, 1/2, 0, 1/2],
          "1" : [1/2, 1, 0, 0, 0],
          "2" : [0, 0, 1/2, 0, 0],
          "3" : [0, 0, 0, 1, 1/2]}
pd.DataFrame(result)

If there are no overlapping groups (e.g., A = x, B = k), they should not be included in the result table.

I solve this problem for one group, but I stucked in two independent group..

CodePudding user response：

You can use GroupBy.value_counts with normalize=True, and reshaping:

(df
 .groupby(['A', 'B'])['C']
 .value_counts(normalize=True)
 .unstack('C', fill_value=0)
 .reset_index()
)

output:

C  A  B    0    1    2    3
0  x  i  0.5  0.5  0.0  0.0
1  x  j  0.0  1.0  0.0  0.0
2  y  j  0.5  0.0  0.5  0.0
3  y  k  0.0  0.0  0.0  1.0
4  z  k  0.5  0.0  0.0  0.5

CodePudding user response：

You can use pd.crosstab, which is just a wrapper for groupby:

pd.crosstab([df['A'], df['B']], df['C'], normalize='index')

Output:

C      0    1    2    3
A B                    
x i  0.5  0.5  0.0  0.0
  j  0.0  1.0  0.0  0.0
y j  0.5  0.0  0.5  0.0
  k  0.0  0.0  0.0  1.0
z k  0.5  0.0  0.0  0.5