Home > Net >  Collect Dataset[Year] to List[Year] - Spark Scala
Collect Dataset[Year] to List[Year] - Spark Scala

Time:03-24

I have a Dataset[Year] that has uses the following schema:

  case class Year(
                  day: Int,
                  month: Int,
                  Year: Int
                )

Is there any way to make a collect maintain the schema?

I tried it:

println("Print  -> " ds.collect().toList)

But the result was: Print -> List([01,01,2022], [31,01,2022]) I expected something like: Print -> List(Year(01,01,2022), Year(31,01,2022)

I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.

That is my method:

class SchemeList[A]{

  def set[A](ds: Dataset[A]): List[A] = {
    ds.collect().toList
  }

}

Apparently the method return is getting the correct signature, but when running the engine, it gets an error:

val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)

Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year

CodePudding user response:

Based on your additional information in your comment:

I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?

Given your initial dataset:

val yearsDF: Dataset[Year] = ???

and that you want to do something like:

val desiredColumns: Array[String] = ???

spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)

You could find the column names of yearsDF by doing:

val desiredColumns: Array[String] = yearsDF.columns

Spark achieves this by using def schema, which is defined on Dataset. You can see the definition of def columns here.

CodePudding user response:

May be you got a DataFrame,not a DataSet. try to use "as" to transform dataframe to dataset. like this

val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
  sparkContext
  .parallelize(years)
  .toDF("day","month","Year")
  .as[Year]
println(df.collect().toList)
  • Related