I am trying to compare two tables with the same columns and then return columns that conflict. For example: Table A:
emp_id | emp_name |
---|---|
1 | John |
2 | Mary |
Table B:
emp_id | emp_name |
---|---|
1 | John |
2 | Karen |
3 | Steve |
In this instance, I want to know that two different names conflict for 2. I do not care that there is an entry in one table that is not in the other, and I don't care if the entry is in both tables if they do not conflict.
So far my approach was to rename the columns as emp_name1 & 2, join the tables and then filter out null values meaning the name only appears in one list this way:
df = df.join(df2, how = 'outer', on = ['emp_id'])
#filter out null vals (meaning no conflict)
df = df.filter((df.emp_name1.isNotNull()) &(df.emp_name2.isNotNull()))
The next step would be to compare the values to see if they are the same, but when I try to do this, it does not work:
df = df.filter((df.emp_name1 = df.emp_name2))
Is there a way to compare columns to each other in this way?
CodePudding user response:
combined_df = pd.concat([df, df2])
print(combined_df[combined_df.duplicated()])
Output:
emp_id emp_name
0 1 John
Or, since I'm not exactly sure what you're asking;
print(combined_df[~combined_df.duplicated()])
...
emp_id emp_name
0 1 John
1 2 Mary
1 2 Karen
2 3 Steve
print(combined_df[~combined_df.duplicated(keep=False)])
...
emp_id emp_name
1 2 Mary
1 2 Karen
2 3 Steve
CodePudding user response:
You can join the tables on emp id as you have already done. Then use where otherwise clause to check if there is conflict. Assuming col name after joining as : emp_id, emp_name_1 and emp_name_2
final_df= df.withColumn("conflict_names", when(col("emp_name_1")!=col("emp_name_2"), col("emp_name_2")).otherwise(lit(None).cast(StringType())))