Home > Back-end >  Combining multiple structs into single struct in Spark SQL
Combining multiple structs into single struct in Spark SQL

Time:12-20

This is the input I have:

val df = Seq(
  ("Adam","Angra", "Anastasia"),
  ("Boris","Borun", "Bisma"),
  ("Shawn","Samar", "Statham")
).toDF("fname", "mname", "lname")
df.createOrReplaceTempView("df")

I want Spark sql output to like below:

struct
{"data_description":"fname","data_details":"Adam"},{"data_description":"mname","data_details":"Angra"},{"data_description":"lname","data_details":"Anastasia"}
{"data_description":"fname","data_details":"Boris"},{"data_description":"mname","data_details":"Borun"},{"data_description":"lname","data_details":"Bisma"}
{"data_description":"fname","data_details":"Shawn"},{"data_description":"mname","data_details":"Samar"},{"data_description":"lname","data_details":"Statham"}

So far I tried below:

val df1 = spark.sql("""select concat(fname,':',mname,":",lname) as name from df""")
df1.createOrReplaceTempView("df1")

val df2 = spark.sql("""select named_struct('data_description','fname','data_details',split(name, ':')[0]) as struct1,named_struct('data_description','mname','data_details',split(name, ':')[1]) as struct2, named_struct('data_description','lname','data_details',split(name, ':')[2]) as struct3 from df1""")
df2.createOrReplaceTempView("df2")

The output from above:

struct1 struct2 struct3
{"data_description":"fname","data_details":"Adam"}  {"data_description":"mname","data_details":"Angra"} {"data_description":"lname","data_details":"Anastasia"}
{"data_description":"fname","data_details":"Boris"} {"data_description":"mname","data_details":"Borun"} {"data_description":"lname","data_details":"Bisma"}
{"data_description":"fname","data_details":"Shawn"} {"data_description":"mname","data_details":"Samar"} {"data_description":"lname","data_details":"Statham"}

But I get 3 different structs. I need all in one single struct separated by commas

CodePudding user response:

The sql statement is as follows, others as you know.

val sql = """
    select
        concat_ws(
            ','
            ,concat('{"data_description":"fname","data_details":"',fname,'"}')
            ,concat('{"data_description":"mname","data_details":"',mname,'"}')
            ,concat('{"data_description":"lname","data_details":"',lname,'"}')
        ) as struct
    from df
"""

CodePudding user response:

You can create array of structs, then use to_json if you want output as string:

spark.sql("""
select  to_json(array(
          named_struct('data_description','fname','data_details', fname),
          named_struct('data_description','mname','data_details', mname), 
          named_struct('data_description','lname','data_details', lname) 
        )) as struct
from  df
""").show()

// ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 
//|struct                                                                                                                                                          |
// ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 
//|[{"data_description":"fname","data_details":"Adam"},{"data_description":"mname","data_details":"Angra"},{"data_description":"lname","data_details":"Anastasia"}]|
//|[{"data_description":"fname","data_details":"Boris"},{"data_description":"mname","data_details":"Borun"},{"data_description":"lname","data_details":"Bisma"}]   |
//|[{"data_description":"fname","data_details":"Shawn"},{"data_description":"mname","data_details":"Samar"},{"data_description":"lname","data_details":"Statham"}] |
// ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 

If you have many columns, you can dynamically generate the struct sql expressions like this:

val structs = df.columns.map(c => s"named_struct('data_description','$c','data_details', $c)").mkString(",")

val df2 = spark.sql(s"""
  select  to_json(array($structs)) as struct
  from  df
""")

If you don't want to use array, you can simply concatenate the result of to_json on the 3 structs:

val structs = df.columns.map(c => s"to_json(named_struct('data_description','$c','data_details', $c))").mkString(",")

val df2 = spark.sql(s"""
  select  concat_ws(',', $structs) as struct
  from  df
""")

  • Related