I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.
Schema:
root
|-- Animal: struct (nullable = true)
| |-- Species: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mammal: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
Currently I am using this logic. This only gets the first array.
df.select(
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)
Pseudo Logic:
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")
Desired Example Output (flattened elements as string)
llama, sheep, rabbit, hare
CodePudding user response:
You can apply explode
two times: first time on Animal.Species
and second time on the result of the first time:
import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
.withColumn("tmp", explode(col("tmp.mammal")))
.select("tmp.description")
.show()
CodePudding user response:
Not obvious, but you can use .
(or the getField
method of Column
) to select "through" arrays of structs. Selecting Animal.Species.mammal
returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description
, so you need to flatten it first, then use getField()
.
If I understand your schema correctly, the following JSON should be a valid input:
{
"Animal": {
"Species": [
{
"mammal": [
{ "description": "llama" },
{ "description": "sheep" }
]
},
{
"mammal": [
{ "description": "rabbit" },
{ "description": "hare" }
]
}
]
}
}
val df = spark.read.json("data.json")
df.printSchema
// root
// |-- Animal: struct (nullable = true)
// | |-- Species: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- mammal: array (nullable = true)
// | | | | |-- element: struct (containsNull = true)
// | | | | | |-- description: string (nullable = true)
df.select("Animal.Species.mammal").show(false)
// ----------------------------------------
// |mammal |
// ----------------------------------------
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
// ----------------------------------------
df.select(flatten(col("Animal.Species.mammal"))).show(false)
// ------------------------------------
// |flatten(Animal.Species.mammal) |
// ------------------------------------
// |[{llama}, {sheep}, {rabbit}, {hare}]|
// ------------------------------------
This is now an array of structs and you can use getField("description")
to obtain the array of interest:
df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
// --------------------------------------------------------
// |flatten(Animal.Species.mammal AS mammal#173).description|
// --------------------------------------------------------
// |[llama, sheep, rabbit, hare] |
// --------------------------------------------------------
Finally, array_join
with separator ", "
can be used to obtain the desired string:
df.select(
array_join(
flatten(col("Animal.Species.mammal")).getField("description"),
", "
) as "animals"
).show(false)
// --------------------------
// |animals |
// --------------------------
// |llama, sheep, rabbit, hare|
// --------------------------