I have a pyspark dataframe called df1
that looks like this:
ID1 | ID2 |
---|---|
aaaa | a1 |
bbbb | a2 |
aaaa | a3 |
bbbb | a4 |
cccc | a2 |
And I have another dataframe called df2
that looks like this:
ID2_1 | ID2_2 |
---|---|
a2 | a1 |
a3 | a2 |
a2 | a3 |
a2 | a1 |
where the values of the ID2 in the first dataframe matches to the values in columns ID2_1, ID2_2 in the second dataframe.
So the resultant dataframe will look like this:
ID1 | ID2 |
---|---|
aaaa | a1 |
bbbb | a2 |
aaaa | a3 |
cccc | a2 |
(fourth line was filtered out)
I want to filter the column ID2 to contain only values that appear in one of the columns ID2_1 or ID2_2. I tried doing
filter= df1.filter((f.col("ID2").isin(df2.ID2_1)))|
(f.col("ID2").isin(df2.ID2_2)))
But this doesn't seem to work.
I have seen other suggestions to use a join
between the two columns but this operation is way too heavy and I'm trying to avoid such actions. Any suggestions as to how to do this task?
CodePudding user response:
Not sure why you would want to avoid join because it may as well be computationa;;y expensive.
Anyway
- create a list of the df2 columns
- append the distinct elements of 1 above in df2
- Filter out where ID2 contains elements in 2 above.
Code below
new = (df1.withColumn('x', array_distinct(array(*[lit(x) for x in [item for sublist in g for item in sublist]]))).where(array_contains(col('x'), col('ID2'))).drop('x'))
new.show(truncate=False)
---- ---
|ID1 |ID2|
---- ---
|bbbb|a2 |
|aaaa|a3 |
|cccc|a2 |
---- ---