Home > Software engineering >  Convert a Pandas Dataframe to specific format
Convert a Pandas Dataframe to specific format

Time:02-15

I have the following dataframe:

d = pd.DataFrame({'UNIQUE_KEY': [1, 2, 3, 4], 'TRANSFORMATION': ['P', 'D', 'N', 'P'], 
     'DIM_1': ['Y', 'N', 'N', 'Y'], 'DIM_2': ['N', 'N', 'N', 'Y'], 'DIM_3': ['Y', 'Y', 'N', 'Y']})


    UNIQUE_KEY  TRANSFORMATION  DIM_1   DIM_2   DIM_3
0   1           P               Y       N       Y
1   2           D               N       N       Y
2   3           N               N       N       N
3   4           P               Y       Y       Y

I want to perform several groupby and aggregate operations in order to get the following output dataframe:

    DIM      DIM_VALUE   TTL_CASES   % CASES   % D    % N    % P 
0   DIM_1    'Y'         2           50        0      0      100
1   DIM_1    'N'         2           50        50     50     0
2   DIM_2    'Y'         1           25        0      0      100
3   DIM_2    'N'         3           75        33.3   33.3   33.3
4   DIM_3    'Y'         3           75        33.3   0      66.6
5   DIM_3    'N'         1           25        0      100    0

Where

  • DIM is a column with each of DIM_1,2,3
  • DIM_VALUE is a grouped column based on the values of each DIM_1,2,3
  • TTL_CASES is a column with the count of UNIQUE_KEY grouped by DIM and DIM_1,2,3
  • PCT_CASES is the percentage of each row of TTL_CASES
  • %D, %P, %N are the percentages of TRANSFORMATION of UNIQUE_KEY based on the the grouped by DIM and DIM_1,2,3

What I have is the following:

P = d.groupby('TRANSFORMATION')['UNIQUE_KEY'].count().reset_index()
P['Percentage'] = 100 * P['UNIQUE_KEY']  / P['UNIQUE_KEY'].sum()

which gives me the percentage of each value in TRANFORMATION but how do I do this for each dimension and get an output dataframe in the format I want?

Thanks in advance!

CodePudding user response:

Seems like there is no easy way to achieve the desired output, you need to break the problem down into multiple steps: First melt the dataframe to convert DIM like columns to rows, then create two grouper objects, calculate the aggregations using the grouper objects as required and assign the results to output dataframe

s = d.melt(['UNIQUE_KEY', 'TRANSFORMATION'], var_name='DIM', value_name='DIM_VALUE')

g1 = s.groupby(['DIM'])
g2 = s.groupby(['DIM', 'DIM_VALUE'])

out = pd.DataFrame()
out['TTL_CASES'] = g2['UNIQUE_KEY'].count()
out['%_TTL_CASES'] = out['TTL_CASES'].div(g1['UNIQUE_KEY'].count()).mul(100)
out = out.assign(**g2['TRANSFORMATION'].value_counts(normalize=True).mul(100).unstack(fill_value=0).add_prefix('%'))

Result

     DIM DIM_VALUE  TTL_CASES  %_TTL_CASES         %D          %N          %P
0  DIM_1         N          2         50.0  50.000000   50.000000    0.000000
1  DIM_1         Y          2         50.0   0.000000    0.000000  100.000000
2  DIM_2         N          3         75.0  33.333333   33.333333   33.333333
3  DIM_2         Y          1         25.0   0.000000    0.000000  100.000000
4  DIM_3         N          1         25.0   0.000000  100.000000    0.000000
5  DIM_3         Y          3         75.0  33.333333    0.000000   66.666667

CodePudding user response:

Try the following code:

Code:

import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({'UNIQUE_KEY': [1, 2, 3, 4], 'TRANSFORMATION': ['P', 'D', 'N', 'P'], 'DIM_1': ['Y', 'N', 'N', 'Y'], 'DIM_2': ['N', 'N', 'N', 'Y'], 'DIM_3': ['Y', 'Y', 'N', 'Y']})

# Transform df shape
df = df.set_index(['UNIQUE_KEY', 'TRANSFORMATION']).stack().to_frame().reset_index()
df.columns = ['UNIQUE_KEY', 'TRANSFORMATION', 'DIM', 'DIM_VALUE']

# Get aggregated values except PCT_CASES
df = df.groupby(['DIM', 'DIM_VALUE'], as_index=False).agg(
    TTL_CASES=pd.NamedAgg('UNIQUE_KEY', 'count'),
    PCT_D=pd.NamedAgg('TRANSFORMATION', lambda s: 100 * (s=='D').sum() / len(s)),
    PCT_N=pd.NamedAgg('TRANSFORMATION', lambda s: 100 * (s=='N').sum() / len(s)),
    PCT_P=pd.NamedAgg('TRANSFORMATION', lambda s: 100 * (s=='P').sum() / len(s)),
).sort_values(['DIM', 'DIM_VALUE'], ascending=[True, False])

# Add PCT_CASES
df.insert(3, 'PCT_CASES', df.groupby('DIM')['TTL_CASES'].transform(lambda s: 100 * s / s.sum()))

Output:

DIM DIM_VALUE TTL_CASES PCT_CASES PCT_D PCT_N PCT_P
DIM_1 Y 2 50 0 0 100
DIM_1 N 2 50 50 50 0
DIM_2 Y 1 25 0 0 100
DIM_2 N 3 75 33.3333 33.3333 33.3333
DIM_3 Y 3 75 33.3333 0 66.6667
DIM_3 N 1 25 0 100 0
  • Related