Home > Software design >  Struct data type when creating dataframe with createDataFrame in Scala
Struct data type when creating dataframe with createDataFrame in Scala

Time:07-27

In PySpark, we can create struct data type when using createDataFrame like in the following example ("b", "c") and ("e", "f")

df = spark.createDataFrame([
    ["a", ("b", "c")],
    ["d", ("e", "f")]
])

df.printSchema()
# root
#  |-- _1: string (nullable = true)
#  |-- _2: struct (nullable = true)
#  |    |-- _1: string (nullable = true)
#  |    |-- _2: string (nullable = true)
df.show()
#  --- ------ 
# | _1|    _2|
#  --- ------ 
# |  a|{b, c}|
# |  d|{e, f}|
#  --- ------ 

Is there a similar way in Scala - to create struct schema inside createDataFrame, without using org.apache.spark.sql.functions?

CodePudding user response:

For your specific example, you can use tuples and call this flavor of createDataFrame.

val df = spark.createDataFrame(Seq(
  ("a", ("b", "c")),
  ("d", ("e", "f"))
))

df.printSchema()
/*
root
 |-- _1: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
*/

df.show()
/*
 --- ------ 
| _1|    _2|
 --- ------ 
|  a|[b, c]|
|  d|[e, f]|
 --- ------ 
*/

Instead of ("b", "c") one can also use "b" -> "c" to create a tuple of length 2.

Preferred method

Tuples can become difficult to manage when dealing with many fields and especially nested fields. Likely, you'll want to model your data using case class(s). This also allows to specify struct field names and types.

case class Person(name: String, age: Int)
case class Car(manufacturer: String, model: String, mileage: Double, owner: Person)

val df = spark.createDataFrame(Seq(
  Car("Toyota", "Camry", 81400.8, Person("John", 37)),
  Car("Honda", "Accord", 152090.2, Person("Jane", 25))
))

df.printSchema()
/*
root
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- mileage: double (nullable = false)
 |-- owner: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- age: integer (nullable = false)
*/

df.show()
/*
 ------------ ------ -------- ---------- 
|manufacturer| model| mileage|     owner|
 ------------ ------ -------- ---------- 
|      Toyota| Camry| 81400.8|[John, 37]|
|       Honda|Accord|152090.2|[Jane, 25]|
 ------------ ------ -------- ---------- 
*/
  • Related