Home > Net >  How do you filter a dataframe to make sure values in two columns differ in Pyspark?
How do you filter a dataframe to make sure values in two columns differ in Pyspark?

Time:04-19

I am trying to compare two tables with the same columns and then return columns that conflict. For example: Table A:

emp_id emp_name
1 John
2 Mary

Table B:

emp_id emp_name
1 John
2 Karen
3 Steve

In this instance, I want to know that two different names conflict for 2. I do not care that there is an entry in one table that is not in the other, and I don't care if the entry is in both tables if they do not conflict.

So far my approach was to rename the columns as emp_name1 & 2, join the tables and then filter out null values meaning the name only appears in one list this way:

df = df.join(df2, how = 'outer', on = ['emp_id'])
#filter out null vals (meaning no conflict)
df = df.filter((df.emp_name1.isNotNull()) &(df.emp_name2.isNotNull()))

The next step would be to compare the values to see if they are the same, but when I try to do this, it does not work: df = df.filter((df.emp_name1 = df.emp_name2))

Is there a way to compare columns to each other in this way?

CodePudding user response:

combined_df = pd.concat([df, df2])
print(combined_df[combined_df.duplicated()])

Output:

   emp_id emp_name
0       1     John

Or, since I'm not exactly sure what you're asking;

print(combined_df[~combined_df.duplicated()])
...
   emp_id emp_name
0       1     John
1       2     Mary
1       2    Karen
2       3    Steve
print(combined_df[~combined_df.duplicated(keep=False)])
...
   emp_id emp_name
1       2     Mary
1       2    Karen
2       3    Steve

CodePudding user response:

You can join the tables on emp id as you have already done. Then use where otherwise clause to check if there is conflict. Assuming col name after joining as : emp_id, emp_name_1 and emp_name_2

final_df= df.withColumn("conflict_names", when(col("emp_name_1")!=col("emp_name_2"), col("emp_name_2")).otherwise(lit(None).cast(StringType())))
  • Related