Home > Back-end >  Filtering records in pyspark dataframe if the struct Array contains a record
Filtering records in pyspark dataframe if the struct Array contains a record

Time:10-23

My Pyspark dataframe looks like this:

|-- name: string (nullable = true)
 |-- other_attr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

I am looking for the rows that don't have [Closed, Yes] in their array of struct under other_attr. The other_attr is an array of struct which could be an empty array. How could I run this filtering?

CodePudding user response:

You can use to_json function with contains to filter rows based on criteria.

import pyspark.sql.functions as F

df2 = df.filter(
    ~F.to_json('other_attr').contains(
        F.to_json(
            F.struct(
                F.lit('Closed').alias('key'),
                F.lit('Yes').alias('value')
            )
        )
    )
)

CodePudding user response:

You can simply use array_contains to check against the struct [Closed, Yes] like so

import pyspark.sql.functions as F

df.show()
#  ----- --------------- 
# | name|     other_attr|
#  ----- --------------- 
# |test1|[{Closed, Yes}]|
# |test2| [{Closed, No}]|
# |test3|             []|
#  ----- --------------- 

(df
    .where(~F
        .array_contains('other_attr', F.struct(
            F.lit('Closed').alias('key'),
            F.lit('Yes').alias('value'),
        ))
    )
    .show()
)

# Output
#  ----- -------------- 
# | name|    other_attr|
#  ----- -------------- 
# |test2|[{Closed, No}]|
# |test3|            []|
#  ----- -------------- 
  • Related