Assume df
has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text
contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3
:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true)
entries along with the exploded key, value
columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option
without selecting df2.select("text").as[String]
separately into the method (see df3
). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.
CodePudding user response:
Use from_xml
function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))