I have the following dataframe:
d = pd.DataFrame({'UNIQUE_KEY': [1, 2, 3, 4], 'TRANSFORMATION': ['P', 'D', 'N', 'P'],
'DIM_1': ['Y', 'N', 'N', 'Y'], 'DIM_2': ['N', 'N', 'N', 'Y'], 'DIM_3': ['Y', 'Y', 'N', 'Y']})
UNIQUE_KEY TRANSFORMATION DIM_1 DIM_2 DIM_3
0 1 P Y N Y
1 2 D N N Y
2 3 N N N N
3 4 P Y Y Y
I want to perform several groupby
and aggregate operations in order to get the following output dataframe:
DIM DIM_VALUE TTL_CASES % CASES % D % N % P
0 DIM_1 'Y' 2 50 0 0 100
1 DIM_1 'N' 2 50 50 50 0
2 DIM_2 'Y' 1 25 0 0 100
3 DIM_2 'N' 3 75 33.3 33.3 33.3
4 DIM_3 'Y' 3 75 33.3 0 66.6
5 DIM_3 'N' 1 25 0 100 0
Where
DIM
is a column with each ofDIM_1,2,3
DIM_VALUE
is a grouped column based on the values of eachDIM_1,2,3
TTL_CASES
is a column with the count ofUNIQUE_KEY
grouped byDIM
andDIM_1,2,3
PCT_CASES
is the percentage of each row ofTTL_CASES
%D
,%P
,%N
are the percentages ofTRANSFORMATION
ofUNIQUE_KEY
based on the the grouped byDIM
andDIM_1,2,3
What I have is the following:
P = d.groupby('TRANSFORMATION')['UNIQUE_KEY'].count().reset_index()
P['Percentage'] = 100 * P['UNIQUE_KEY'] / P['UNIQUE_KEY'].sum()
which gives me the percentage of each value in TRANFORMATION
but how do I do this for each dimension and get an output dataframe in the format I want?
Thanks in advance!
CodePudding user response:
Seems like there is no easy way to achieve the desired output, you need to break the problem down into multiple steps: First melt the dataframe to convert DIM like columns to rows, then create two grouper objects, calculate the aggregations using the grouper objects as required and assign the results to output dataframe
s = d.melt(['UNIQUE_KEY', 'TRANSFORMATION'], var_name='DIM', value_name='DIM_VALUE')
g1 = s.groupby(['DIM'])
g2 = s.groupby(['DIM', 'DIM_VALUE'])
out = pd.DataFrame()
out['TTL_CASES'] = g2['UNIQUE_KEY'].count()
out['%_TTL_CASES'] = out['TTL_CASES'].div(g1['UNIQUE_KEY'].count()).mul(100)
out = out.assign(**g2['TRANSFORMATION'].value_counts(normalize=True).mul(100).unstack(fill_value=0).add_prefix('%'))
Result
DIM DIM_VALUE TTL_CASES %_TTL_CASES %D %N %P
0 DIM_1 N 2 50.0 50.000000 50.000000 0.000000
1 DIM_1 Y 2 50.0 0.000000 0.000000 100.000000
2 DIM_2 N 3 75.0 33.333333 33.333333 33.333333
3 DIM_2 Y 1 25.0 0.000000 0.000000 100.000000
4 DIM_3 N 1 25.0 0.000000 100.000000 0.000000
5 DIM_3 Y 3 75.0 33.333333 0.000000 66.666667
CodePudding user response:
Try the following code:
Code:
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame({'UNIQUE_KEY': [1, 2, 3, 4], 'TRANSFORMATION': ['P', 'D', 'N', 'P'], 'DIM_1': ['Y', 'N', 'N', 'Y'], 'DIM_2': ['N', 'N', 'N', 'Y'], 'DIM_3': ['Y', 'Y', 'N', 'Y']})
# Transform df shape
df = df.set_index(['UNIQUE_KEY', 'TRANSFORMATION']).stack().to_frame().reset_index()
df.columns = ['UNIQUE_KEY', 'TRANSFORMATION', 'DIM', 'DIM_VALUE']
# Get aggregated values except PCT_CASES
df = df.groupby(['DIM', 'DIM_VALUE'], as_index=False).agg(
TTL_CASES=pd.NamedAgg('UNIQUE_KEY', 'count'),
PCT_D=pd.NamedAgg('TRANSFORMATION', lambda s: 100 * (s=='D').sum() / len(s)),
PCT_N=pd.NamedAgg('TRANSFORMATION', lambda s: 100 * (s=='N').sum() / len(s)),
PCT_P=pd.NamedAgg('TRANSFORMATION', lambda s: 100 * (s=='P').sum() / len(s)),
).sort_values(['DIM', 'DIM_VALUE'], ascending=[True, False])
# Add PCT_CASES
df.insert(3, 'PCT_CASES', df.groupby('DIM')['TTL_CASES'].transform(lambda s: 100 * s / s.sum()))
Output:
DIM | DIM_VALUE | TTL_CASES | PCT_CASES | PCT_D | PCT_N | PCT_P |
---|---|---|---|---|---|---|
DIM_1 | Y | 2 | 50 | 0 | 0 | 100 |
DIM_1 | N | 2 | 50 | 50 | 50 | 0 |
DIM_2 | Y | 1 | 25 | 0 | 0 | 100 |
DIM_2 | N | 3 | 75 | 33.3333 | 33.3333 | 33.3333 |
DIM_3 | Y | 3 | 75 | 33.3333 | 0 | 66.6667 |
DIM_3 | N | 1 | 25 | 0 | 100 | 0 |