Home > Enterprise >  Compare same column of merged dataframe and append difference in a new column
Compare same column of merged dataframe and append difference in a new column

Time:03-11

I have two dataframes Df1 and Df2. I have merged Df1,Df2 on column a. Now I want to compare the same name columns of the merged dataframe and append in "miss_match" column if the values dont match.

eg: compare b_PE with b_PS, c_PE with c_PS. if the value of b_xx does not match and value of c_xx does not match then we will append mismatch column with b,c

similarly if value of b_xx match and value of c_xx does not match then we will append mismatch column with c

Df1
a   b   c
a   '1' 1
2   '2' 5
c   '3' 3

Df2
a   b   c
a   '4' 0
2   '2' 2
c   '6' 3

merged_df

a   b_PE    b_PS    c_PE    c_PS
a   '4'       '1'      0       1
2   '2'       '2'      2       5
c   '6'       '3'      3       3

expected result:

a   b_PE    b_PS    c_PE    c_PS    miss_match
a   '4'       '1'      0       1      b,c
2   '2'       '2'      2       5      c
c   '6'       '3'      3       3      b

Any suggestions on this will be very helpful. Thanks

CodePudding user response:

Here is an approach using a groupby on the columns:

merged_df = df2.merge(df1, on='a', suffixes=('_PE', '_PS'))

group = merged_df.columns.str.replace('_.*', '', regex=True)
mask = merged_df.groupby(group, axis=1).nunique().gt(1).stack()
s = mask[mask].reset_index(level=1).groupby(level=0)['level_1'].agg(','.join).rename('miss_match')

merged_df.join(s)

output:

   a  b_PE  c_PE  b_PS  c_PS miss_match
0  a     4     0     1     1        b,c
1  2     2     2     2     5          c
2  c     6     3     3     3          b

CodePudding user response:

First use DataFrame.merge by a with suffixes parameters, then compare values for not match by DataFrame.ne with converting to numpy array because different columnsnames and get new column by matrix multiplication by DataFrame.dot with columns names b, c (values before _):

df = df1.merge(df2, on='a', suffixes=('_PS','_PE'))

df1 = df[['b_PS','c_PS']].astype(int).ne(df[['b_PE','c_PE']].astype(int).to_numpy())

df = df[['a','b_PE','b_PS','c_PE','c_PS']]

df['miss_match'] = df1.dot(df1.columns.str.split('_').str[0]   ',').str.rstrip(',')
    
print (df)
   a  b_PE  b_PS  c_PE  c_PS miss_match
0  a     4     1     0     1        b,c
1  2     2     2     2     5          c
2  c     6     3     3     3          b

More general solution with MultiIndex - not necessary specify columns names:

df = df1.merge(df2, on='a', suffixes=('_PS','_PE'))

df1 = df.set_index('a').astype(int)
df1.columns = df1.columns.str.split('_', expand=True)

df1 = df1.xs('PS', level=1, axis=1).ne(df1.xs('PE', level=1, axis=1))
print (df1)
       b      c
a              
a   True   True
2  False   True
c   True  False

df['miss_match'] = df1.dot(df1.columns   ',').str.rstrip(',').reset_index(drop=True)
print (df)
   a  b_PS  c_PS  b_PE  c_PE miss_match
0  a     1     1     4     0        b,c
1  2     2     5     2     2          c
2  c     3     3     6     3          b
  • Related