Home > database >  How to groupby multiple columns in dataframe, except one in python
How to groupby multiple columns in dataframe, except one in python

Time:01-28

I have the following dataframe:

      ID    Code     Color   Value
-----------------------------------
0    111     AAA      Blue      23
1    111     AAA       Red      43
2    111     AAA     Green       4
3    121     ABA     Green      45
4    121     ABA     Green      23
5    121     ABA       Red      75
6    122     AAA       Red      52
7    122     ACA      Blue      24
8    122     ACA      Blue      53
9    122     ACA     Green      14
...

I want to group this dataframe by the columns "ID", and "Code", and sum the values from the "Value" column, while excluding the "Color" column from this grouping. Or in other words, I want to groupy by all non-Value columns, except for the "Color" column, and then sum the values from the "Value" column. I am using python for this.

What I am thinking of doing is creating a list of all column names that are not "Color" and "Value", and creating this "column_list", and then simply running:

df.groupby['column_list'].sum()

Though this will not work. How might I augment this code so that I can properly groupby as intended?

EDIT:

This code works:

bins = df.groupby([df.columns[0], 
                   df.columns[1], 
                   df.columns[2]).count() 

bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))

Full code that is not working:

column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list, as_index=False)['Value'].count()

bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()  
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))

CodePudding user response:

You can pass list to groupby and specify column for aggregate sum:

column_list = [c for c in df.columns if c not in ['Color','Value']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()

Or:

column_list = list(df.columns.difference(['Color','Value'], sort=False))
df1 = df.groupby(column_list, as_index=False)['Value'].sum()

It working with sample data like:

df1 = df.groupby(['ID','Code'], as_index=False)['Value'].sum()

EDIT: Yes, also working:

column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()

Reason is because sum remove by default not numeric columns and if not specified Value it summed all columns.

So if Color is numeric, it sum it too:

print (df)
    ID Code  Color  Value
0  111  AAA      1     23
1  111  AAA      2     43
2  111  AAA      3      4
3  121  ABA      1     45
4  121  ABA      1     23
5  121  ABA      2     75
6  122  AAA      1     52
7  122  ACA      2     24
8  122  ACA      1     53
9  122  ACA      2     14

column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
print (df1)
    ID Code  Value  Color
0  111  AAA      4      3
1  111  AAA     23      1
2  111  AAA     43      2
3  121  ABA     23      1
4  121  ABA     45      1
5  121  ABA     75      2
6  122  AAA     52      1
7  122  ACA     14      2
8  122  ACA     24      2
9  122  ACA     53      1


column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
print (df1)
    ID Code  Value
0  111  AAA      4
1  111  AAA     23
2  111  AAA     43
3  121  ABA     23
4  121  ABA     45
5  121  ABA     75
6  122  AAA     52
7  122  ACA     14
8  122  ACA     24
9  122  ACA     53

EDIT: If need MultiIndex in bins remove as_index=False and column after groupby:

bins = df.groupby([df.columns[0], 
                   df.columns[1], 
                   df.columns[2]).count() 

should be changed to:

column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list).count()
  • Related