Pyspark DataFrame Filter column based on a column in another DataFrame without join-CodePudding

I have a pyspark dataframe called df1 that looks like this:

ID1	ID2
aaaa	a1
bbbb	a2
aaaa	a3
bbbb	a4
cccc	a2

And I have another dataframe called df2 that looks like this:

ID2_1	ID2_2
a2	a1
a3	a2
a2	a3
a2	a1

where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.

So the resultant dataframe will look like this:

ID1	ID2
aaaa	a1
bbbb	a2
aaaa	a3
cccc	a2

(fourth line was filtered out)

I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2. I tried doing

filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
                   (f.col("ID2").isin(df2.ID2_2)))

But this doesn't seem to work. I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I'm trying to avoid such actions. Any suggestions as to how to do this task?

CodePudding user response：

Not sure why you would want to avoid join because it may as well be computationa;;y expensive.

Anyway

create a list of the df2 columns
append the distinct elements of 1 above in df2
Filter out where ID2 contains elements in 2 above.

Code below

new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))

new.show(truncate=False)

 ---- --- 
|ID1 |ID2|
 ---- --- 
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
 ---- ---