I have a col in a dataframe which is an array of structs. There are some structs with all null values which I would like to filter out. For example with the following dataframe:
————— —————————————————
|advertiser |brands |
————— —————————————————
Advertiser1 [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2 [{"id" : null, "type" : null, "name" : null}]
————— —————————————————
I would like to filter out the struct with the null values to get:
————— —————————————————
|advertiser |brands |
————— —————————————————
Advertiser1 [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2 []
————— —————————————————
I'm thinking it's something along the lines of this if I can come up with a struct of null values:
.withColumn(
"brands",
when(
col("brands").equalTo(*emptyStruct?*),
null
)
)
CodePudding user response:
You can try to use the to_json function, brands with all null values will returns [{}].
CodePudding user response:
You want to filter the array elements. There is a method in org.apache.spark.sql.functions
with the signature:
filter(column: Column, f: Column => Column): Column
which does this for you:
df.select(
col("advertiser"),
filter(col("brands"),
b =>
b.getField("id").isNotNull
&& b.getField("type").isNotNull
&& b.getField("name").isNotNull) as "brands"