Home > other >  Spark sql get collect_set ignore null on other columns
Spark sql get collect_set ignore null on other columns

Time:03-11

I need union result and collect set but want to ignore null

val df1 = Seq(
  ("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df1.createOrReplaceTempView("df1")

val df2 = Seq(
  ("1",null,null, "Bosma")
).toDF("id","fname", "mname", "lname")
df2.createOrReplaceTempView("df2")

df2 dataframe always has fname and mname null - I need lname concatenated as list when grouped by id

Current query:

select id,fname,mname,collect_set(lname) as lname from (select * from df1 union select * from df2) group by id,fname, mname

Actual Output

id  fname   mname   lname
1   Adam    Angra   ["Anastasia"]
1   null    null    ["Bosma"]

Expected Output

id  fname   mname   lname
1   Adam    Angra   ["Anastasia","Bosma"]

Need help to get above expected result with SQL query

CodePudding user response:

You can group by id and use the first function (ignoring null values) to get fname, mname.

val sql = """
    select id,first(fname, true) as fname,first(mname, true) as mname,collect_set(lname) as lname from
        (select * from df1 union select * from df2)
    group by id
"""
val df = spark.sql(sql)
df.show()
  • Related