Home > OS >  How to get element wise boolean array if array elements of column1 exists in array column2? [Pyspark
How to get element wise boolean array if array elements of column1 exists in array column2? [Pyspark

Time:07-24

My pyspark dataframe has two columns of Array(StringType).

I want to check if items in column1 are present in column2. Based on that I want to get bool column which gives me index-wise 1 or 0 if the item exists.

I tried using np.in1d() and np.isin() but it gives me an error since this is a pyspark dataframe. I have been trying to figure this for quite some time now so any help will be appreciated!

col1 col2 result col
[item1, item2, item3] [item5, item2, item3, item17] [0, 1, 1, 0]
[item3, item5, item6, item9] [item3, item2, item9, item5, item12] [1, 0, 1, 1, 0]

CodePudding user response:

You can use the transform and array_contains functions to determine whether an element in col2 appears in col1.

import pyspark.sql.functions as F

...
df = df.withColumn(
    'result col',
    F.transform('col2', lambda x: F.when(F.array_contains('col1', x), 1).otherwise(0))
)
df.show(truncate=False)

#  ---------------------------- ------------------------------------ --------------- 
# |col1                        |col2                                |result col     |
#  ---------------------------- ------------------------------------ --------------- 
# |[item1, item2, item3]       |[item5, item2, item3, item17]       |[0, 1, 1, 0]   |
# |[item3, item5, item6, item9]|[item3, item2, item9, item5, item12]|[1, 0, 1, 1, 0]|
#  ---------------------------- ------------------------------------ --------------- 
  • Related