Home > OS >  Combine all dataframe columns into one column as a JSON with JSON types preserved
Combine all dataframe columns into one column as a JSON with JSON types preserved

Time:11-19

val someDF = Seq(
  (8, """{"details":{"decision":"ACCEPT","source":"Rules"}"""),
  (64, """{"details":{"decision":"ACCEPT","source":"Rules"}""")
).toDF("number", "word")

someDF.show(false) :

 ------ --------------------------------------------------------------- 
|number|word                                                           |
 ------ --------------------------------------------------------------- 
|8     |{"details":{"decision":"ACCEPT","source":"Rules"}              |
|64    |{"details":{"decision":"ACCEPT","source":"Rules"}              |
 ------ --------------------------------------------------------------- 

Problem statement: I want to combine all columns into 1 column with JSON types preserved inside the single output column. That is no escaping of quotes etc. like I got below.

What I tried:

someDF.toJSON.toDF.show(false)

// this escaped the quotes, which I don't want
 ------------------------------------------------------------------------------------------------ 
|value                                                                                           |
 ------------------------------------------------------------------------------------------------ 
|{"number":8,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"}               |
|{"number":64,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"}              |
 ------------------------------------------------------------------------------------------------ 

Same issue with someDF.select( to_json(struct(col("*"))).alias("value"))

What I want:

 ------------------------------------------------------------------------------------------------ 
|value                                                                                           |
 ------------------------------------------------------------------------------------------------ 
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}                          |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}                         |
 ------------------------------------------------------------------------------------------------ 

Is there a way to do this?

Update: Though I used a simple dataframe here, in reality I have hundreds of columns so manually defined schema doesn't work for me.

CodePudding user response:

The "word" column in "someDF" is string type, so to_json treats it as a regular string. The key here is to convert the "word" column to a struct type before using to_json.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val someDF = Seq(
  (8, """{"details":{"decision":"ACCEPT","source":"Rules"}}"""),
  (64, """{"details":{"decision":"ACCEPT","source":"Rules"}}""")
).toDF("number", "word")

val schema = StructType(Seq(StructField("details", StructType(Seq(StructField("decision", StringType), StructField("source", StringType))))))
someDF.select(to_json(struct($"number", from_json($"word", schema).alias("word"))).alias("value")).show(false)

Result:

 ----------------------------------------------------------------------- 
|value                                                                  |
 ----------------------------------------------------------------------- 
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}|
 ----------------------------------------------------------------------- 

CodePudding user response:

You can retrieve the list of columns using columns method on your dataframe and then build manually your JSON string using combination of concat and concat_ws built-in functions:

import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}

val result = someDF.select(
  concat(
    lit("{"),
    concat_ws(
      ",", 
      someDF.columns.map(x => concat(lit("\""), lit(x), lit("\":"), col(x))): _*
    ),
    lit("}")).as("value")
)
  • Related