Home > other >  Pandas Aggregate columns dynamically
Pandas Aggregate columns dynamically

Time:02-01

My goal is to aggregate data similar to SAS's "proc summary using types" My starting pandas dataframe could look like this where the database has already done the original group by all dimensions/classification variables and done some aggregate function on the measures. So in sql this would look like

select gender, age, sum(height), sum(weight)
from db.table
group by gender, age
gender age height weight
F 19 70 123
M 24 72 172

I then would like to summarize the data using pandas to calculate summary rows based on different group bys to come out with this.

gender age height weight
. . 142 295
. 19 70 123
. 24 72 172
F . 70 123
M . 72 172
F 19 70 123
M 24 72 172

Where the first row is agg with no group by 2 and 3 row are agg grouped by age 4 and 5 agg by just gender and then just the normal rows

My current code looks like this

# normally dynamic just hard coded for this example
measures = {'height':{'stat':'sum'}, 'age':{'stat':'sum'}}
msr_config_dict = {}
for measure in measures:
    if measure in message_measures:
       stat = measures[measure]['stat']
       msr_config_dict[measure] = pd.NamedAgg(measure, stat)

# compute agg with no group by as starting point
df=self.df.agg(**msr_config_dict)
dimensions = ['gender','age'] # also dimensions is dynamic in real life
dim_vars = []
for dim in dimensions:
    dim_vars.append(dim)
    if len(dim_vars) > 1:
       # compute agg of compound dimensions
       df_temp = self.df.groupby(dim_vars, as_index=False).agg(msr_config_dict)
       df = df.append(df_temp, ignore_index=True)
    # always compute agg of solo dimension
    df_temp = self.df.groupby(dim, as_index=False).agg(msr_config_dict)
    df = df.append(df_temp, ignore_index=True)

With this code I get AttributeError: 'height' is not a valid function for 'Series' object

For the input to agg function I have also tried {'height':[('height', 'sum')], 'weight':[('weight', 'sum')]} where I am trying to compute the sum of all heights and name the output height. Which also had an attribute error. I know I will only ever be computing one aggregate function per measure so I would like to dynamically build the input to the pandas agg functon and always rename the stat to itself so I can just append it to the dataframe that I am building with the summary rows.

I am new to pandas coming from SAS background. Any help would be much appreciated.

CodePudding user response:

IIUC:

cols = ['height', 'weight']

out = pd.concat([df[cols].sum(0).to_frame().T,
                 df.groupby('age')[cols].sum().reset_index(),
                 df.groupby('gender')[cols].sum().reset_index(),
                 df], ignore_index=True)[df.columns].fillna('.')

Output:

>>> out
  gender   age  height  weight
0      .     .     142     295
1      .  19.0      70     123
2      .  24.0      72     172
3      F     .      70     123
4      M     .      72     172
5      F  19.0      70     123
6      M  24.0      72     172

CodePudding user response:

Here is a more flexible solution, extending the solution of @Corralien. You can use itertools.combinations to create all the combinations of dimensions and for all length of combination possible.

from itertools import combinations

# your input
measures = {'height':{'stat':'sum'}, 'weight':{'stat':'min'}}
dimensions = ['gender','age'] 

# change the nested dictionary
msr_config_dict = {key:val['stat'] for key, val in measures.items()}

# concat all possible aggregation
res = pd.concat(
    # case with all aggregated
    [df.agg(msr_config_dict).to_frame().T] 
    # cases at least one column to aggregate over
      [df.groupby(list(_dimCols)).agg(msr_config_dict).reset_index()
       # for combinations of length 1, 2.. depending on the number of dimensions
       for nb_cols in range(1, len(dimensions))
       # all combinations of the specific lenght
       for  _dimCols in combinations(dimensions, nb_cols)]
    # original dataframe
      [df],
    ignore_index=True)[df.columns].fillna('.')

print(res)
#   gender   age  height  weight
# 0      .     .     142     123
# 1      F     .      70     123
# 2      M     .      72     172
# 3      .  19.0      70     123
# 4      .  24.0      72     172
# 5      F  19.0      70     123
# 6      M  24.0      72     172
  •  Tags:  
  • Related