Home > Back-end >  Group by eleven multiple columns in pandas
Group by eleven multiple columns in pandas

Time:10-27

I have a dataframe like this:

COLUMN A     COLUMN B    COLUMN C    COLUMN D     VOLUME
2018-01-01    INFO A      INFO B       INFO C       1.2
2018-01-01    INFO A      INFO B       INFO C       2.3
2018-01-01    INFO D      INFO G       INFO H       1.5
2019-01-01    INFO E      INFO W       INFO R       1.8
2019-01-01    INFO E      INFO W       INFO R       1.5

I need to delete the duplicated values and sum VOLUME, my output need to be like this:

COLUMN A     COLUMN B    COLUMN C    COLUMN D     VOLUME
2018-01-01    INFO A      INFO B       INFO C       3.5
2018-01-01    INFO D      INFO G       INFO H       1.5
2019-01-01    INFO E      INFO W       INFO R       3.3

I already tried:

df['Volume(t)'] = df.groupby(['Month', 'Client', 'ClientName', 'DestinationCountry', 'Region', 'Client Group', 'Grade', 'Family', 'VIPE', 'Segment', 'Sub-segment'])['Volume(t)'].sum()

The error was: "TypeError: incompatible index of inserted column with frame index".

Please, save me.

CodePudding user response:

For the case you mentioned, here is the code helping to get the wanted output:

import pandas as pd

d={'COLUMN A':['2018-01-01','2018-01-01','2018-01-01','2019-01-01','2019-01-01'],
   'COLUMN B':['INFO A','INFO A','INFO D','INFO E','INFO E'] ,
   'COLUMN C': ['INFO B','INFO B','INFO G','INFO w','INFO w'] ,
   'COLUMN D':['INFO C','INFO C','INFO H','INFO R', 'INFO R'],
   'VOLUME':[1.2,2.3,1.5,1.8,1.5]}

df= pd.DataFrame(d)
df.drop_duplicates(inplace=True) # to delet the duplicate lines
df.groupby(['COLUMN A','COLUMN B','COLUMN C','COLUMN D'])['VOLUME'].sum()

The output got :

COLUMN A    COLUMN B  COLUMN C  COLUMN D
2018-01-01  INFO A    INFO B    INFO C      3.5
            INFO D    INFO G    INFO H      1.5
2019-01-01  INFO E    INFO w    INFO R      3.3
Name: VOLUME, dtype: float64

CodePudding user response:

Here is a generic working solution to groupby all columns but the one to sum. This creates a new dataframe, you do not assign it back to a column of the original one.

df.groupby(list(df.columns.difference(['VOLUME'])), as_index=False).sum()

Output:

     COLUMN A COLUMN B COLUMN C COLUMN D  VOLUME
0  2018-01-01   INFO A   INFO B   INFO C     3.5
1  2018-01-01   INFO D   INFO G   INFO H     1.5
2  2019-01-01   INFO E   INFO W   INFO R     3.3
  • Related