Home > Software engineering >  Iterate dataframe column Array of Array in Spark Scala
Iterate dataframe column Array of Array in Spark Scala

Time:10-06

I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.

Schema:

root
 |-- Animal: struct (nullable = true)
 |    |-- Species: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- mammal: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- description: string (nullable = true)

Currently I am using this logic. This only gets the first array.

df.select(
   col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)

Pseudo Logic:

col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
 
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
 
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
 
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")

Desired Example Output (flattened elements as string)

llama, sheep, rabbit, hare

CodePudding user response:

You can apply explode two times: first time on Animal.Species and second time on the result of the first time:

import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
  .withColumn("tmp", explode(col("tmp.mammal")))
  .select("tmp.description")
  .show()

CodePudding user response:

Not obvious, but you can use . (or the getField method of Column) to select "through" arrays of structs. Selecting Animal.Species.mammal returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description, so you need to flatten it first, then use getField().

If I understand your schema correctly, the following JSON should be a valid input:

{
  "Animal": {
    "Species": [
      {
        "mammal": [
          { "description": "llama" },
          { "description": "sheep" }
        ]
      },
      {
        "mammal": [
          { "description": "rabbit" },
          { "description": "hare" }
        ]
      }
    ]
  }
}
val df = spark.read.json("data.json")
df.printSchema
// root
//  |-- Animal: struct (nullable = true)
//  |    |-- Species: array (nullable = true)
//  |    |    |-- element: struct (containsNull = true)
//  |    |    |    |-- mammal: array (nullable = true)
//  |    |    |    |    |-- element: struct (containsNull = true)
//  |    |    |    |    |    |-- description: string (nullable = true)

df.select("Animal.Species.mammal").show(false)
//  ---------------------------------------- 
// |mammal                                  |
//  ---------------------------------------- 
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
//  ---------------------------------------- 

df.select(flatten(col("Animal.Species.mammal"))).show(false)
//  ------------------------------------ 
// |flatten(Animal.Species.mammal)      |
//  ------------------------------------ 
// |[{llama}, {sheep}, {rabbit}, {hare}]|
//  ------------------------------------ 

This is now an array of structs and you can use getField("description") to obtain the array of interest:

df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
//  -------------------------------------------------------- 
// |flatten(Animal.Species.mammal AS mammal#173).description|
//  -------------------------------------------------------- 
// |[llama, sheep, rabbit, hare]                            |
//  -------------------------------------------------------- 

Finally, array_join with separator ", " can be used to obtain the desired string:

df.select(
  array_join(
    flatten(col("Animal.Species.mammal")).getField("description"),
    ", "
  ) as "animals"
).show(false)
//  -------------------------- 
// |animals                   |
//  -------------------------- 
// |llama, sheep, rabbit, hare|
//  -------------------------- 
  • Related