I have the following dataframe:
ID Code Color Value
-----------------------------------
0 111 AAA Blue 23
1 111 AAA Red 43
2 111 AAA Green 4
3 121 ABA Green 45
4 121 ABA Green 23
5 121 ABA Red 75
6 122 AAA Red 52
7 122 ACA Blue 24
8 122 ACA Blue 53
9 122 ACA Green 14
...
I want to group this dataframe by the columns "ID", and "Code", and sum the values from the "Value" column, while excluding the "Color" column from this grouping. Or in other words, I want to groupy by all non-Value columns, except for the "Color" column, and then sum the values from the "Value" column. I am using python for this.
What I am thinking of doing is creating a list of all column names that are not "Color" and "Value", and creating this "column_list", and then simply running:
df.groupby['column_list'].sum()
Though this will not work. How might I augment this code so that I can properly groupby as intended?
EDIT:
This code works:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
Full code that is not working:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list, as_index=False)['Value'].count()
bins["Weight"] = bins / bins.groupby(df.columns[0]).sum()
bins.reset_index(inplace=True)
bins['Weight'] = bins['Weight'].round(4)
display(HTML(bins.to_html()))
CodePudding user response:
You can pass list to groupby
and specify column for aggregate sum
:
column_list = [c for c in df.columns if c not in ['Color','Value']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
Or:
column_list = list(df.columns.difference(['Color','Value'], sort=False))
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
It working with sample data like:
df1 = df.groupby(['ID','Code'], as_index=False)['Value'].sum()
EDIT: Yes, also working:
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
Reason is because sum
remove by default not numeric columns and if not specified Value
it summed all columns.
So if Color
is numeric, it sum it too:
print (df)
ID Code Color Value
0 111 AAA 1 23
1 111 AAA 2 43
2 111 AAA 3 4
3 121 ABA 1 45
4 121 ABA 1 23
5 121 ABA 2 75
6 122 AAA 1 52
7 122 ACA 2 24
8 122 ACA 1 53
9 122 ACA 2 14
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False).sum()
print (df1)
ID Code Value Color
0 111 AAA 4 3
1 111 AAA 23 1
2 111 AAA 43 2
3 121 ABA 23 1
4 121 ABA 45 1
5 121 ABA 75 2
6 122 AAA 52 1
7 122 ACA 14 2
8 122 ACA 24 2
9 122 ACA 53 1
column_list = [c for c in df.columns if c not in ['Color']]
df1 = df.groupby(column_list, as_index=False)['Value'].sum()
print (df1)
ID Code Value
0 111 AAA 4
1 111 AAA 23
2 111 AAA 43
3 121 ABA 23
4 121 ABA 45
5 121 ABA 75
6 122 AAA 52
7 122 ACA 14
8 122 ACA 24
9 122 ACA 53
EDIT: If need MultiIndex
in bins
remove as_index=False
and column after groupby
:
bins = df.groupby([df.columns[0],
df.columns[1],
df.columns[2]).count()
should be changed to:
column_list = [c for c in df.columns if c not in ['Value']]
bins = df.groupby(column_list).count()