How do you filter a dataframe to make sure values in two columns differ in Pyspark?-CodePudding

I am trying to compare two tables with the same columns and then return columns that conflict. For example: Table A:

emp_id	emp_name
1	John
2	Mary

Table B:

emp_id	emp_name
1	John
2	Karen
3	Steve

In this instance, I want to know that two different names conflict for 2. I do not care that there is an entry in one table that is not in the other, and I don't care if the entry is in both tables if they do not conflict.

So far my approach was to rename the columns as emp_name1 & 2, join the tables and then filter out null values meaning the name only appears in one list this way:

df = df.join(df2, how = 'outer', on = ['emp_id'])
#filter out null vals (meaning no conflict)
df = df.filter((df.emp_name1.isNotNull()) &(df.emp_name2.isNotNull()))

The next step would be to compare the values to see if they are the same, but when I try to do this, it does not work: df = df.filter((df.emp_name1 = df.emp_name2))

Is there a way to compare columns to each other in this way?

CodePudding user response：

combined_df = pd.concat([df, df2])
print(combined_df[combined_df.duplicated()])

Output:

   emp_id emp_name
0       1     John

Or, since I'm not exactly sure what you're asking;

print(combined_df[~combined_df.duplicated()])
...
   emp_id emp_name
0       1     John
1       2     Mary
1       2    Karen
2       3    Steve

print(combined_df[~combined_df.duplicated(keep=False)])
...
   emp_id emp_name
1       2     Mary
1       2    Karen
2       3    Steve

CodePudding user response：

You can join the tables on emp id as you have already done. Then use where otherwise clause to check if there is conflict. Assuming col name after joining as : emp_id, emp_name_1 and emp_name_2

final_df= df.withColumn("conflict_names", when(col("emp_name_1")!=col("emp_name_2"), col("emp_name_2")).otherwise(lit(None).cast(StringType())))