How to Merge Rows in Pandas DF based on Filter Criteria-CodePudding

I have a pandas dataframe where there are a number of columns. In some cases the rows are duplicated with slight variations in some of the columns eg:

NAME	UNIQUE CODE	COUNTRIES	ATTRBITUE	ANOTHER ATTRIBUTE	NUMBERS
Name	XYZABC	UA EU SA	X	DELTA	2
Name	XYZABC	CH	Y	ALPHA	3

What I want to do is concatenate the above to get:

NAME	UNIQUE CODE	COUNTRIES	ATTRIBUTE	ANOTHER ATTRIBUTE	NUMBERS
Name	XYZABC	UA EU SA CH	X Y	DELTA ALPHA	5

How do I identify duplicate rows based on say the 'UNIQUE CODE' and concatenate the other row values under the different columns with the exception of the UNIQUE CODE?

CodePudding user response：

You can use groupby agg and agg function can take dictionary of column name as key and an aggregation function as value.

For NAME column, you are taking only 1 value => first (or last).
For NUMBERS column, you are getting the sum of the column. => sum
For other columns, you are getting concat of strings => ' '.join

So, you need this dictionary to pass for agg function.

{
  'NAME': 'first',
  'NUMBERS': 'sum',
  'COUNTRIES': ' '.join,
  'ATTRIBUTE': ' '.join,
  'ANOTHER ATTRIBUTE': ' '.join
}

You can pass this hardcoded dictionary directly to the agg function. Or you can do this a little bit more compact like this.

join_col = ['COUNTRIES', 'ATTRIBUTE', 'ANOTHER ATTRIBUTE']
(df.groupby('UNIQUE CODE', as_index=False)
 .agg({
       **{'NAME': 'first', 'NUMBERS': 'sum'},
       **{x: ' '.join for x in join_col]}
 })
)