I have a dataframe like this:
COLUMN A COLUMN B COLUMN C COLUMN D VOLUME
2018-01-01 INFO A INFO B INFO C 1.2
2018-01-01 INFO A INFO B INFO C 2.3
2018-01-01 INFO D INFO G INFO H 1.5
2019-01-01 INFO E INFO W INFO R 1.8
2019-01-01 INFO E INFO W INFO R 1.5
I need to delete the duplicated values and sum VOLUME, my output need to be like this:
COLUMN A COLUMN B COLUMN C COLUMN D VOLUME
2018-01-01 INFO A INFO B INFO C 3.5
2018-01-01 INFO D INFO G INFO H 1.5
2019-01-01 INFO E INFO W INFO R 3.3
I already tried:
df['Volume(t)'] = df.groupby(['Month', 'Client', 'ClientName', 'DestinationCountry', 'Region', 'Client Group', 'Grade', 'Family', 'VIPE', 'Segment', 'Sub-segment'])['Volume(t)'].sum()
The error was: "TypeError: incompatible index of inserted column with frame index".
Please, save me.
CodePudding user response:
For the case you mentioned, here is the code helping to get the wanted output:
import pandas as pd
d={'COLUMN A':['2018-01-01','2018-01-01','2018-01-01','2019-01-01','2019-01-01'],
'COLUMN B':['INFO A','INFO A','INFO D','INFO E','INFO E'] ,
'COLUMN C': ['INFO B','INFO B','INFO G','INFO w','INFO w'] ,
'COLUMN D':['INFO C','INFO C','INFO H','INFO R', 'INFO R'],
'VOLUME':[1.2,2.3,1.5,1.8,1.5]}
df= pd.DataFrame(d)
df.drop_duplicates(inplace=True) # to delet the duplicate lines
df.groupby(['COLUMN A','COLUMN B','COLUMN C','COLUMN D'])['VOLUME'].sum()
The output got :
COLUMN A COLUMN B COLUMN C COLUMN D
2018-01-01 INFO A INFO B INFO C 3.5
INFO D INFO G INFO H 1.5
2019-01-01 INFO E INFO w INFO R 3.3
Name: VOLUME, dtype: float64
CodePudding user response:
Here is a generic working solution to groupby all columns but the one to sum. This creates a new dataframe, you do not assign it back to a column of the original one.
df.groupby(list(df.columns.difference(['VOLUME'])), as_index=False).sum()
Output:
COLUMN A COLUMN B COLUMN C COLUMN D VOLUME
0 2018-01-01 INFO A INFO B INFO C 3.5
1 2018-01-01 INFO D INFO G INFO H 1.5
2 2019-01-01 INFO E INFO W INFO R 3.3