In Spark using Scala - When we have to convert RDD[Row] to DataFrame. Why we have to convert the RDD[Row] to RDD of case classor RDD of tuple in order to use rdd.toDF() Any specific reason it was not provided for the RDD[Row]
object RDDParallelize {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("learn")
.getOrCreate()
val abc = Row("val1","val2")
val abc2 = Row("val1","val2")
val rdd1 = spark.sparkContext.parallelize(Seq(abc,abc2))
import spark.implicits._
rdd1.toDF() //doesn't work
}
}
CodePudding user response:
it is confusing since there are implicit conversion for the toDF methode. Like you may have seen, toDF is not a methode of Rdd class, but it is defined in DatasetHolder, you are using rddToDatasetHolder in SQLImplicits to convert the rdd you created to a DatasetHolder. if you look into the methode rddToDatasetHolder,
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(rdd))
}
you will see that it requires a T of Encoder which is
Used to convert a JVM object of type T to and from the internal Spark SQL representation.
if you try to convert a Rdd[Row] to Datasetholder you will need one encoder to tell spark how you convert Row object to internal SQL representation. However
Primitive types (Int, String, etc) and Product types (case " "classes) are supported by importing spark.implicits._ Support for serializing other types " "will be added in future releases
spark does not have any encoder for Row type so such conversion never finished successfully.