I need union result and collect set but want to ignore null
val df1 = Seq(
("1","Adam","Angra", "Anastasia")
).toDF("id","fname", "mname", "lname")
df1.createOrReplaceTempView("df1")
val df2 = Seq(
("1",null,null, "Bosma")
).toDF("id","fname", "mname", "lname")
df2.createOrReplaceTempView("df2")
df2 dataframe always has fname and mname null - I need lname concatenated as list when grouped by id
Current query:
select id,fname,mname,collect_set(lname) as lname from (select * from df1 union select * from df2) group by id,fname, mname
Actual Output
id fname mname lname
1 Adam Angra ["Anastasia"]
1 null null ["Bosma"]
Expected Output
id fname mname lname
1 Adam Angra ["Anastasia","Bosma"]
Need help to get above expected result with SQL query
CodePudding user response:
You can group by id
and use the first
function (ignoring null
values) to get fname
, mname
.
val sql = """
select id,first(fname, true) as fname,first(mname, true) as mname,collect_set(lname) as lname from
(select * from df1 union select * from df2)
group by id
"""
val df = spark.sql(sql)
df.show()