Find columns to select, for spark.read(), from another Dataset

I have a Dataset[Year] that has the following schema:

case class Year(day: Int, month: Int, Year: Int)

Is there any way to make a collection of the current schema?

I have tried:

println("Print  -> " ds.collect().toList)

But the result were: Print -> List([01,01,2022], [31,01,2022])

I expected something like: Print -> List(Year(01,01,2022), Year(31,01,2022)

I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.

That is my method:

class SchemeList[A]{

  def set[A](ds: Dataset[A]): List[A] = {
    ds.collect().toList
  }

}

Apparently the method return is getting the correct signature, but when running the engine, it gets an error:

val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)

Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year

CodePudding user response：

Based on your additional information in your comment:

I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?

Given your initial dataset:

val yearsDS: Dataset[Year] = ???

and that you want to do something like:

val desiredColumns: Array[String] = ???

spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)

You could find the column names of yearsDS by doing:

val desiredColumns: Array[String] = yearsDS.columns

Spark achieves this by using def schema, which is defined on Dataset. You can see the definition of def columns here.

CodePudding user response：

May be you got a DataFrame,not a DataSet. try to use "as" to transform dataframe to dataset. like this

val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
  sparkContext
  .parallelize(years)
  .toDF("day","month","Year")
  .as[Year]
println(df.collect().toList)