Home > Software engineering >  Dealing with many columns and groupby/agg in pandas (without manually assigning all aggregation func
Dealing with many columns and groupby/agg in pandas (without manually assigning all aggregation func

Time:04-14

I have a data frame output from something that looks like below and always has a variable number of numerical columns denoted here as t1-t4.

hash         name group trial t1 t2 t3 t4
1AomKHNL56l  EP1  G01   clump 1  4   5   9
2oKA7J1B3GL  EP2  G02   green 2  10  0   24
zsfAu5Q6I60  EP1  G01   clump 4  3   1   3
v68fGHY8zx1  EP2  G02   green 1  5   9  22

What I want to do is group the dataframe on the "group" category and sum all the numerical columns which I know I can do using

df.groupby(["group"]).sum()

This drops all my non-numerical columns though and I want something somewhat dynamic. For the "hash column" I want something that looks like the output of this:

df.groupy(["group"]).agg("hash":list)

And for the other non-numerical columns I just want to maintain them as is because they will all be the same across the unique group ids. So the end result would be something like this:

hash                      name group trial t1 t2 t3 t4
[1AomKHNL56l,zsfAu5Q6I60] EP1  G01   clump 5  7   6   12
[2oKA7J1B3GL,v68fGHY8zx1] EP2  G02   green 3  15  9   46

Right now the only ways I can think to do it involve either separately doing the agg and sum functions for each unique column case then merging the dataframes at the end OR typing out every column id in the agg function. I'm wondering if anyone knows of a way to natively do this in the groupby function without all the extra dataframe merging steps or having to list out every column id and what to do. Perhaps some kind of lambda statement?

The column ids for the hash, name, group, and trial are always the same but the numerical columns always have different names dependent on what sample groups are being looked at for reference.

CodePudding user response:

You can use a dictionary of aggregation functions:

# default is sum
d = {c: 'sum' for c in df.columns}

# change a few other columns
d.update({'hash': list, 'name': 'first', 'group': 'first', 'trial': 'first'})

# aggregate
df.groupby('group', as_index=False).agg(d)

Alternative option to set up the dictionary:

d = {'first': ['name', 'group', 'trial'],
     'sum': ['t1', 't2', 't3', 't4'],
     list: ['hash']}

d = {k:v for v,l in d.items() for k in l}

df.groupby('group', as_index=False).agg(d)[df.columns]

NB. you an also combine both!

output:

                         hash name group  trial  t1  t2  t3  t4
0  [1AomKHNL56l, zsfAu5Q6I60]  EP1   G01  clump   5   7   6  12
1  [2oKA7J1B3GL, v68fGHY8zx1]  EP2   G02  green   3  15   9  46
  • Related