Is it possible for spark to automatically infer the schema and convert a Dataframe to a Dataset without the programmer having to create a case class for each join?
import spark.implicits._
case class DfLeftClass(
id: Long,
name: String,
age: Int
)
val dfLeft = Seq(
(1,"Tim",30),
(2,"John",15),
(3,"Pens",20)
).toDF("id","name", "age").as[DfLeftClass]
case class DfRightClass(
id: Long,
name: String,
age: Int
hobby: String
)
val dfRight = Seq(
(1,"Tim",30,"Swimming"),
(2,"John",15,"Reading"),
(3,"Pens",20,"Programming")
).toDF("id","name", "age", "hobby").as[DfRightClass]
val joined: DataFrame = dfLeft.join(dfRight) // this results in DataFrame instead of a Dataset
CodePudding user response:
To stay within the Dataset API, you can use joinWith. This function returns a Dataset of tuples containing both sides of the join:
val joined: Dataset[(DfLeftClass, DfRightClass)] = dfLeft.joinWith(dfRight,
dfLeft.col("id").eqNullSafe(dfRight.col("id")))
Result:
------------- --------------------------
|_1 |_2 |
------------- --------------------------
|{1, Tim, 30} |{1, Tim, 30, Swimming} |
|{2, John, 15}|{2, John, 15, Reading} |
|{3, Pens, 20}|{3, Pens, 20, Programming}|
------------- --------------------------
From here you can either continue working with the tuples or you can map the tuples to a third case class.