Get the boolean array column in PySpark dataframe indicating whether the existing two array column h-CodePudding

I have a PySpark dataframe having two array columns as seen below:

col1	col2
[1, 2, 3]	[1, 4, 3]
[5, 4, 3]	[5]

I want the result to be like this:

col3
[1, 0, 1]
[1, 0, 0]

Also, number of items in col1 is always fixed and greater than or equal to number of items in col2.

I know we can do it using udf. But I am looking for an optimised way of doing it using PySpark SQL functions.

CodePudding user response：

Higher order function transform can do it. For every element in array of col1, you check if it is contained in the array of col2.

If you need array of int:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([1, 2, 3], [1, 4, 3]),
     ([5, 4, 3], [5])],
    ['col1', 'col2'])

df = df.select(
    F.transform('col1', lambda x: F.array_contains('col2', x).cast('int')).alias('col3')
)
df.show()
#  --------- 
# |     col3|
#  --------- 
# |[1, 0, 1]|
# |[1, 0, 0]|
#  ---------

If you need array of boolean:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [([1, 2, 3], [1, 4, 3]),
     ([5, 4, 3], [5])],
    ['col1', 'col2'])

df = df.select(
    F.transform('col1', lambda x: F.array_contains('col2', x)).alias('col3')
)
df.show()
#  -------------------- 
# |                col3|
#  -------------------- 
# | [true, false, true]|
# |[true, false, false]|
#  --------------------

CodePudding user response：

Another way if you are a fun of expressions;

df.selectExpr('*',"transform(transform(col1, x-> array_contains(col2,x)),y->cast(y as integer)) as col3").show()

 --------- --------- --------- 
|     col1|     col2|     col3|
 --------- --------- --------- 
|[1, 2, 3]|[1, 4, 3]|[1, 0, 1]|
|[5, 4, 3]|      [5]|[1, 0, 0]|
 --------- --------- ---------