I have a pandas dataframe where there are a number of columns. In some cases the rows are duplicated with slight variations in some of the columns eg:
NAME | UNIQUE CODE | COUNTRIES | ATTRBITUE | ANOTHER ATTRIBUTE | NUMBERS |
---|---|---|---|---|---|
Name | XYZABC | UA EU SA | X | DELTA | 2 |
Name | XYZABC | CH | Y | ALPHA | 3 |
What I want to do is concatenate the above to get:
NAME | UNIQUE CODE | COUNTRIES | ATTRIBUTE | ANOTHER ATTRIBUTE | NUMBERS |
---|---|---|---|---|---|
Name | XYZABC | UA EU SA CH | X Y | DELTA ALPHA | 5 |
How do I identify duplicate rows based on say the 'UNIQUE CODE' and concatenate the other row values under the different columns with the exception of the UNIQUE CODE?
CodePudding user response:
You can use groupby
agg
and agg
function can take dictionary of column name as key and an aggregation function as value.
- For NAME column, you are taking only 1 value =>
first
(orlast
). - For NUMBERS column, you are getting the sum of the column. =>
sum
- For other columns, you are getting concat of strings =>
' '.join
So, you need this dictionary to pass for agg
function.
{
'NAME': 'first',
'NUMBERS': 'sum',
'COUNTRIES': ' '.join,
'ATTRIBUTE': ' '.join,
'ANOTHER ATTRIBUTE': ' '.join
}
You can pass this hardcoded dictionary directly to the agg
function. Or you can do this a little bit more compact like this.
join_col = ['COUNTRIES', 'ATTRIBUTE', 'ANOTHER ATTRIBUTE']
(df.groupby('UNIQUE CODE', as_index=False)
.agg({
**{'NAME': 'first', 'NUMBERS': 'sum'},
**{x: ' '.join for x in join_col]}
})
)