Home > other >  spark scala: extracting xml from one column
spark scala: extracting xml from one column

Time:11-20

Assume df has the following structure:

root
 |-- id: decimal(38,0) (nullable = true)
 |-- text: string (nullable = true)

here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:

First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)

val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))

Next, parsing the XML:

val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])

This generates df3:

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

which I finally explode:

val df4 = df3.withColumn("exploded_cols", explode($"row"))

into

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- exploded_cols: struct (nullable = true)
 |    |-- key: string (nullable = true)
 |    |-- value: string (nullable = true)

My goal is the following table:

val df5 = df4.select("exploded_cols.*")

with

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

Main question: I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,

root
 |-- id: decimal(38,0) (nullable = true)
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?

This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an background, so I'm missing all the scala basics, but trying to learn as I go.

CodePudding user response:

Use from_xml function of spak-xml library.

val df = // Read source data
val schema = // Define schema of XML text

df.withColumn("xmlData", from_xml("xmlColName", schema))
  • Related