I have this dataframe:
import pandas as pd
df = pd.DataFrame({"ID": ["123", "456", "123", "456", "123", "456", "456", "456", "456", "123"],
"number": [2,1,5,2,4,5,2,1,6,7],
"name": ["apple", "orange", "orange", "pear", "pear", "apple", "lemon", "apple", "lemon", "apple"]
})
ID number name
0 123 2 apple
1 456 1 orange
2 123 5 orange
3 456 2 pear
4 123 4 pear
5 456 5 apple
6 456 2 lemon
7 456 1 apple
8 456 6 lemon
9 123 7 apple
I want to calculate percentage share for each element i name column, grouped by ID.
Final look of the DataFrame should have these columns:
ID, number, apple, orange, pear, lemon
So, values from name column are becoming column names. Values that those columns should have are percentage share, for example:
for ID 123 I have these number values: 2 5 4 7 = 18
So
apple has: (2 7) / 18 = 0.50
orange has: 5 / 18 = 0.27
pear has: 4 / 18 = 0.23
lemon has: 0
ID apple orange pear lemon
123 0.5 0.27 0.23 0
CodePudding user response:
You can use a crosstab
with normalize='index'
:
out = pd.crosstab(df['ID'], df['name'], values=df['number'],
aggfunc='sum', normalize='index')
output:
name apple lemon orange pear
ID
123 0.500000 0.000000 0.277778 0.222222
456 0.352941 0.470588 0.058824 0.117647
CodePudding user response:
Use GroupBy.transform
with Series.div
by different groups:
df['Perc'] = (df.groupby(['ID','name'])['number'].transform('sum')
.div(df.groupby('ID')['number'].transform('sum')))
print (df)
ID number name Perc
0 123 2 apple 0.500000
1 456 1 orange 0.058824
2 123 5 orange 0.277778
3 456 2 pear 0.117647
4 123 4 pear 0.222222
5 456 5 apple 0.352941
6 456 2 lemon 0.470588
7 456 1 apple 0.352941
8 456 6 lemon 0.470588
9 123 7 apple 0.500000
If need another ouput use DataFrame.pivot_table
with division by sum
:
df = df.pivot_table(index='ID',
columns='name',
values='number',
aggfunc='sum',
fill_value=0)
df = df.div(df.sum(axis=1), axis=0)
print (df)
name apple lemon orange pear
ID
123 0.500000 0.000000 0.277778 0.222222
456 0.352941 0.470588 0.058824 0.117647