val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}""")
).toDF("number", "word")
someDF.show(false)
:
------ ---------------------------------------------------------------
|number|word |
------ ---------------------------------------------------------------
|8 |{"details":{"decision":"ACCEPT","source":"Rules"} |
|64 |{"details":{"decision":"ACCEPT","source":"Rules"} |
------ ---------------------------------------------------------------
Problem statement: I want to combine all columns into 1 column with JSON types preserved inside the single output column. That is no escaping of quotes etc. like I got below.
What I tried:
someDF.toJSON.toDF.show(false)
// this escaped the quotes, which I don't want
------------------------------------------------------------------------------------------------
|value |
------------------------------------------------------------------------------------------------
|{"number":8,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"} |
|{"number":64,"word":"{\"details\":{\"decision\":\"ACCEPT\",\"source\":\"Rules\"}"} |
------------------------------------------------------------------------------------------------
Same issue with someDF.select( to_json(struct(col("*"))).alias("value"))
What I want:
------------------------------------------------------------------------------------------------
|value |
------------------------------------------------------------------------------------------------
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
------------------------------------------------------------------------------------------------
Is there a way to do this?
Update: Though I used a simple dataframe here, in reality I have hundreds of columns so manually defined schema doesn't work for me.
CodePudding user response:
The "word" column in "someDF" is string type, so to_json
treats it as a regular string. The key here is to convert the "word" column to a struct type before using to_json
.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val someDF = Seq(
(8, """{"details":{"decision":"ACCEPT","source":"Rules"}}"""),
(64, """{"details":{"decision":"ACCEPT","source":"Rules"}}""")
).toDF("number", "word")
val schema = StructType(Seq(StructField("details", StructType(Seq(StructField("decision", StringType), StructField("source", StringType))))))
someDF.select(to_json(struct($"number", from_json($"word", schema).alias("word"))).alias("value")).show(false)
Result:
-----------------------------------------------------------------------
|value |
-----------------------------------------------------------------------
|{"number":8,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}} |
|{"number":64,"word":{"details":{"decision":"ACCEPT","source":"Rules"}}}|
-----------------------------------------------------------------------
CodePudding user response:
You can retrieve the list of columns using columns
method on your dataframe and then build manually your JSON string using combination of concat
and concat_ws
built-in functions:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
val result = someDF.select(
concat(
lit("{"),
concat_ws(
",",
someDF.columns.map(x => concat(lit("\""), lit(x), lit("\":"), col(x))): _*
),
lit("}")).as("value")
)