I have two dataframes Df1 and Df2. I have merged Df1,Df2 on column a. Now I want to compare the same name columns of the merged dataframe and append in "miss_match" column if the values dont match.
eg: compare b_PE with b_PS, c_PE with c_PS. if the value of b_xx does not match and value of c_xx does not match then we will append mismatch column with b,c
similarly if value of b_xx match and value of c_xx does not match then we will append mismatch column with c
Df1
a b c
a '1' 1
2 '2' 5
c '3' 3
Df2
a b c
a '4' 0
2 '2' 2
c '6' 3
merged_df
a b_PE b_PS c_PE c_PS
a '4' '1' 0 1
2 '2' '2' 2 5
c '6' '3' 3 3
expected result:
a b_PE b_PS c_PE c_PS miss_match
a '4' '1' 0 1 b,c
2 '2' '2' 2 5 c
c '6' '3' 3 3 b
Any suggestions on this will be very helpful. Thanks
CodePudding user response:
Here is an approach using a groupby
on the columns:
merged_df = df2.merge(df1, on='a', suffixes=('_PE', '_PS'))
group = merged_df.columns.str.replace('_.*', '', regex=True)
mask = merged_df.groupby(group, axis=1).nunique().gt(1).stack()
s = mask[mask].reset_index(level=1).groupby(level=0)['level_1'].agg(','.join).rename('miss_match')
merged_df.join(s)
output:
a b_PE c_PE b_PS c_PS miss_match
0 a 4 0 1 1 b,c
1 2 2 2 2 5 c
2 c 6 3 3 3 b
CodePudding user response:
First use DataFrame.merge
by a
with suffixes
parameters, then compare values for not match by DataFrame.ne
with converting to numpy array because different columnsnames and get new column by matrix multiplication by DataFrame.dot
with columns names b, c
(values before _
):
df = df1.merge(df2, on='a', suffixes=('_PS','_PE'))
df1 = df[['b_PS','c_PS']].astype(int).ne(df[['b_PE','c_PE']].astype(int).to_numpy())
df = df[['a','b_PE','b_PS','c_PE','c_PS']]
df['miss_match'] = df1.dot(df1.columns.str.split('_').str[0] ',').str.rstrip(',')
print (df)
a b_PE b_PS c_PE c_PS miss_match
0 a 4 1 0 1 b,c
1 2 2 2 2 5 c
2 c 6 3 3 3 b
More general solution with MultiIndex
- not necessary specify columns names:
df = df1.merge(df2, on='a', suffixes=('_PS','_PE'))
df1 = df.set_index('a').astype(int)
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.xs('PS', level=1, axis=1).ne(df1.xs('PE', level=1, axis=1))
print (df1)
b c
a
a True True
2 False True
c True False
df['miss_match'] = df1.dot(df1.columns ',').str.rstrip(',').reset_index(drop=True)
print (df)
a b_PS c_PS b_PE c_PE miss_match
0 a 1 1 4 0 b,c
1 2 2 5 2 2 c
2 c 3 3 6 3 b