My pyspark dataframe has two columns of Array(StringType).
I want to check if items in column1 are present in column2. Based on that I want to get bool column which gives me index-wise 1 or 0 if the item exists.
I tried using np.in1d() and np.isin() but it gives me an error since this is a pyspark dataframe. I have been trying to figure this for quite some time now so any help will be appreciated!
col1 | col2 | result col |
---|---|---|
[item1, item2, item3] | [item5, item2, item3, item17] | [0, 1, 1, 0] |
[item3, item5, item6, item9] | [item3, item2, item9, item5, item12] | [1, 0, 1, 1, 0] |
CodePudding user response:
You can use the transform
and array_contains
functions to determine whether an element in col2
appears in col1
.
import pyspark.sql.functions as F
...
df = df.withColumn(
'result col',
F.transform('col2', lambda x: F.when(F.array_contains('col1', x), 1).otherwise(0))
)
df.show(truncate=False)
# ---------------------------- ------------------------------------ ---------------
# |col1 |col2 |result col |
# ---------------------------- ------------------------------------ ---------------
# |[item1, item2, item3] |[item5, item2, item3, item17] |[0, 1, 1, 0] |
# |[item3, item5, item6, item9]|[item3, item2, item9, item5, item12]|[1, 0, 1, 1, 0]|
# ---------------------------- ------------------------------------ ---------------