Home > Software design >  Get the boolean array column in PySpark dataframe indicating whether the existing two array column h
Get the boolean array column in PySpark dataframe indicating whether the existing two array column h

Time:07-27

I have a PySpark dataframe having two array columns as seen below:

col1 col2
[1, 2, 3] [1, 4, 3]
[5, 4, 3] [5]

I want the result to be like this:

col3
[1, 0, 1]
[1, 0, 0]

Also, number of items in col1 is always fixed and greater than or equal to number of items in col2.

I know we can do it using udf. But I am looking for an optimised way of doing it using PySpark SQL functions.

CodePudding user response:

Higher order function transform can do it. For every element in array of col1, you check if it is contained in the array of col2.

  • If you need array of int:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [([1, 2, 3], [1, 4, 3]),
         ([5, 4, 3], [5])],
        ['col1', 'col2'])
    
    df = df.select(
        F.transform('col1', lambda x: F.array_contains('col2', x).cast('int')).alias('col3')
    )
    df.show()
    #  --------- 
    # |     col3|
    #  --------- 
    # |[1, 0, 1]|
    # |[1, 0, 0]|
    #  --------- 
    
  • If you need array of boolean:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [([1, 2, 3], [1, 4, 3]),
         ([5, 4, 3], [5])],
        ['col1', 'col2'])
    
    df = df.select(
        F.transform('col1', lambda x: F.array_contains('col2', x)).alias('col3')
    )
    df.show()
    #  -------------------- 
    # |                col3|
    #  -------------------- 
    # | [true, false, true]|
    # |[true, false, false]|
    #  -------------------- 
    

CodePudding user response:

Another way if you are a fun of expressions;

df.selectExpr('*',"transform(transform(col1, x-> array_contains(col2,x)),y->cast(y as integer)) as col3").show()

 --------- --------- --------- 
|     col1|     col2|     col3|
 --------- --------- --------- 
|[1, 2, 3]|[1, 4, 3]|[1, 0, 1]|
|[5, 4, 3]|      [5]|[1, 0, 0]|
 --------- --------- --------- 
  • Related