python pandas compare columns and add list of columns that differ-CodePudding

I would like to compare multiple columns in a data frame and add a new column that tells me which columns are different for each row.

for example, for this dataframe i want to compare a1 to a2 and b1 to b2:

   a1  b1  a2  b2
0   1   2   1   2
1   1   2   1   3
2   1   2   3   4

the output should be something like:

   a1  b1  a2  b2  diff
0   1   2   1   2  
1   1   2   1   3  'b1-b2'
2   1   2   3   4  'a1-a2,b1-b2'

this is what i have so far:

import numpy as np
import pandas as pd
data = [{'a1': 1, 'b1': 2, 'a2':1, 'b2':2},
        {'a1':1, 'b1': 2, 'a2': 1, 'b2':3},
        {'a1':1, 'b1': 2, 'a2':3 , 'b2':4}]
df = pd.DataFrame(data)

compare = [('a1','a2'),('b1','b2')]
comp_result = np.array([(df[x[0]] != df[x[1]]) for x in compare])

comp_result is a list of lists of True/False values for each of the comparisons but i am not sure how to use that to create the "diff" column.

CodePudding user response：

Fast one-liner without loops:

col_groups = [c.columns for _, c in df.groupby(df.columns.str[0], axis=1)]

df['diff'] = pd.Series(np.sum([(df[l] != df[r]).map({True: f'{l}-{r}',False:''})   ',' for l, r in col_groups], axis=0)).str.strip(',')

Output:

>>> df
   a1  b1  a2  b2         diff
0   1   2   1   2             
1   1   2   1   3        b1-b2
2   1   2   3   4  a1-a2,b1-b2

CodePudding user response：

You can create a function compare_rows that take each row of the dataframe and uses compare to loop through each desired comparison. Using list comprehension, keep only the different columns for each row and format the output in the desired format using str.join. Then you can apply the function to each row using the .apply DataFrame method.

import numpy as np
import pandas as pd
data = [{'a1': 1, 'b1': 2, 'a2':1, 'b2':2},
        {'a1':1, 'b1': 2, 'a2': 1, 'b2':3},
        {'a1':1, 'b1': 2, 'a2':3 , 'b2':4}]
df = pd.DataFrame(data)
compare = [('a1','a2'),('b1','b2')]

def compare_rows(row):
    differences = ['-'.join(comp)
                   for comp in compare
                   if row[comp[0]] != row[comp[1]]]
    return ','.join(differences)

df['diff'] = df.apply(compare_rows, axis=1)