Home > Blockchain >  How to Merge Rows in Pandas DF based on Filter Criteria
How to Merge Rows in Pandas DF based on Filter Criteria

Time:05-18

I have a pandas dataframe where there are a number of columns. In some cases the rows are duplicated with slight variations in some of the columns eg:

NAME UNIQUE CODE COUNTRIES ATTRBITUE ANOTHER ATTRIBUTE NUMBERS
Name XYZABC UA EU SA X DELTA 2
Name XYZABC CH Y ALPHA 3

What I want to do is concatenate the above to get:

NAME UNIQUE CODE COUNTRIES ATTRIBUTE ANOTHER ATTRIBUTE NUMBERS
Name XYZABC UA EU SA CH X Y DELTA ALPHA 5

How do I identify duplicate rows based on say the 'UNIQUE CODE' and concatenate the other row values under the different columns with the exception of the UNIQUE CODE?

CodePudding user response:

You can use groupby agg and agg function can take dictionary of column name as key and an aggregation function as value.

  • For NAME column, you are taking only 1 value => first (or last).
  • For NUMBERS column, you are getting the sum of the column. => sum
  • For other columns, you are getting concat of strings => ' '.join

So, you need this dictionary to pass for agg function.

{
  'NAME': 'first',
  'NUMBERS': 'sum',
  'COUNTRIES': ' '.join,
  'ATTRIBUTE': ' '.join,
  'ANOTHER ATTRIBUTE': ' '.join
}

You can pass this hardcoded dictionary directly to the agg function. Or you can do this a little bit more compact like this.

join_col = ['COUNTRIES', 'ATTRIBUTE', 'ANOTHER ATTRIBUTE']
(df.groupby('UNIQUE CODE', as_index=False)
 .agg({
       **{'NAME': 'first', 'NUMBERS': 'sum'},
       **{x: ' '.join for x in join_col]}
 })
)
  • Related