Imagine you have a pyspark data frame df
with three columns: A, B, C
. I want to take the rows in the data frame where the value of B does not exist in C.
Example:
A B C
a 1 2
b 2 4
c 3 6
d 4 8
would return
A B C
a 1 2
c 3 6
What I tried
df.filter(~df.B.isin(df.C))
I also tried to making the values of B into a list, but that takes a significant amount of time.
CodePudding user response:
The problem is how you're using isin
. For better or worse, isin
can't actually handle another pyspark Column
object as an input, it needs an actual collection. So one thing you could do is convert your column to a list :
col_values = df.select("C").rdd.flatMap(lambda x: x).collect()
df.filter(~df.B.isin(col_values))
Performance wise though, this is obviously not ideal as your master node is now in charge of manipulating the entire contents of the single column you've just loaded into memory. You could use a left anti join to get the result you need without having to transform anything into a list and losing the efficiency of spark distributed computing :
df0 = df[["C"]].withColumnRenamed("C", "B")
df.join(df0, "B", "leftanti").show()
Thanks to Emma in the comments for her contribution.