Home > other >  Pyspark DataFrame Filter column based on a column in another DataFrame without join
Pyspark DataFrame Filter column based on a column in another DataFrame without join

Time:12-29

I have a pyspark dataframe called df1 that looks like this:

ID1 ID2
aaaa a1
bbbb a2
aaaa a3
bbbb a4
cccc a2

And I have another dataframe called df2 that looks like this:

ID2_1 ID2_2
a2 a1
a3 a2
a2 a3
a2 a1

where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.

So the resultant dataframe will look like this:

ID1 ID2
aaaa a1
bbbb a2
aaaa a3
cccc a2

(fourth line was filtered out)

I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2. I tried doing

filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
                   (f.col("ID2").isin(df2.ID2_2)))

But this doesn't seem to work. I have seen other suggestions to use a join between the two columns but this operation is way too heavy and I'm trying to avoid such actions. Any suggestions as to how to do this task?

CodePudding user response:

Not sure why you would want to avoid join because it may as well be computationa;;y expensive.

Anyway

  1. create a list of the df2 columns
  2. append the distinct elements of 1 above in df2
  3. Filter out where ID2 contains elements in 2 above.

Code below

new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))

new.show(truncate=False)

 ---- --- 
|ID1 |ID2|
 ---- --- 
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
 ---- --- 
  • Related