I'm new in Spark, Scala, so sorry for stupid question. So I have a number of tables:
table_a, table_b, ...
and number of corresponding types for these tables
case class classA(...), case class classB(...), ...
Then I need to write a methods that read data from these tables and create dataset:
def getDataFromSource: Dataset[classA] = {
val df: DataFrame = spark.sql("SELECT * FROM table_a")
df.as[classA]
}
The same for other tables and types. Is there any way to avoid routine code - I mean individual fucntion for each table and get by with one? For example:
def getDataFromSource[T: Encoder](table_name: String): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
Then create list of pairs (table_name, type_name):
val tableTypePairs = List(("table_a", classA), ("table_b", classB), ...)
Then to call it using foreach:
tableTypePairs.foreach(tupl => getDataFromSource[what should I put here?](tupl._1))
Thanks in advance!
CodePudding user response:
Something like this should work
def getDataFromSource[T](table_name: String, encoder: Encoder[T]): Dataset[T] =
spark.sql(s"SELECT * FROM $table_name").as(encoder)
val tableTypePairs = List(
"table_a" -> implicitly[Encoder[classA]],
"table_b" -> implicitly[Encoder[classB]]
)
tableTypePairs.foreach {
case (table, enc) =>
getDataFromSource(table, enc)
}
Note that this is a case of discarding a value, which is a bit of a code smell. Since Encoder
is invariant, tableTypePairs
isn't going to have that useful of a type, and neither would something like
tableTypePairs.map {
case (table, enc) =>
getDataFromSource(table, enc)
}
CodePudding user response:
One option is to pass the Class
to the method, this way the generic type T
will be inferred:
def getDataFromSource[T: Encoder](table_name: String, clazz: Class[T]): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
tableTypePairs.foreach { case (table name, clazz) => getDataFromSource(tableName, clazz) }
But then I'm not sure of how you'll be able to exploit this list of Dataset
without .asInstanceOf
.