I have this dataframe
my_data = {"A" : ['x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z'],
"B" : ['i', 'i', 'j', 'j', 'j', 'k', 'k', 'k', 'k'],
"C" : [0, 1, 1, 2, 0, 3, 3, 3, 0]}
pd.DataFrame(my_data)
A and B are independent groups and C is a category variable, I would like to find the percentage of C by group A and B.
There are so many categories of C, so I try to solve without using dummy variable.
like this
result = {"A" : ['x', 'x', 'y', 'y', 'z'],
"B" : ['i', 'j', 'j', 'k', 'k'],
"0" : [1/2, 0, 1/2, 0, 1/2],
"1" : [1/2, 1, 0, 0, 0],
"2" : [0, 0, 1/2, 0, 0],
"3" : [0, 0, 0, 1, 1/2]}
pd.DataFrame(result)
If there are no overlapping groups (e.g., A = x, B = k), they should not be included in the result table.
I solve this problem for one group, but I stucked in two independent group..
CodePudding user response:
You can use GroupBy.value_counts
with normalize=True
, and reshaping:
(df
.groupby(['A', 'B'])['C']
.value_counts(normalize=True)
.unstack('C', fill_value=0)
.reset_index()
)
output:
C A B 0 1 2 3
0 x i 0.5 0.5 0.0 0.0
1 x j 0.0 1.0 0.0 0.0
2 y j 0.5 0.0 0.5 0.0
3 y k 0.0 0.0 0.0 1.0
4 z k 0.5 0.0 0.0 0.5
CodePudding user response:
You can use pd.crosstab
, which is just a wrapper for groupby:
pd.crosstab([df['A'], df['B']], df['C'], normalize='index')
Output:
C 0 1 2 3
A B
x i 0.5 0.5 0.0 0.0
j 0.0 1.0 0.0 0.0
y j 0.5 0.0 0.5 0.0
k 0.0 0.0 0.0 1.0
z k 0.5 0.0 0.0 0.5