My Pyspark dataframe looks like this:
|-- name: string (nullable = true)
|-- other_attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I am looking for the rows that don't have [Closed, Yes] in their array of struct under other_attr. The other_attr is an array of struct which could be an empty array. How could I run this filtering?
CodePudding user response:
You can use to_json function with contains to filter rows based on criteria.
import pyspark.sql.functions as F
df2 = df.filter(
~F.to_json('other_attr').contains(
F.to_json(
F.struct(
F.lit('Closed').alias('key'),
F.lit('Yes').alias('value')
)
)
)
)
CodePudding user response:
You can simply use array_contains
to check against the struct [Closed, Yes]
like so
import pyspark.sql.functions as F
df.show()
# ----- ---------------
# | name| other_attr|
# ----- ---------------
# |test1|[{Closed, Yes}]|
# |test2| [{Closed, No}]|
# |test3| []|
# ----- ---------------
(df
.where(~F
.array_contains('other_attr', F.struct(
F.lit('Closed').alias('key'),
F.lit('Yes').alias('value'),
))
)
.show()
)
# Output
# ----- --------------
# | name| other_attr|
# ----- --------------
# |test2|[{Closed, No}]|
# |test3| []|
# ----- --------------