How to manually create a Dataset with a Set column in Scala-CodePudding

I'm trying to manually create a dataset with a type Set column:

case class Files(Record: String, ids: Set)
val files = Seq(
  Files("202110260931", Set(770010, 770880)),
  Files("202110260640", Set(770010, 770880)),
  Files("202110260715", Set(770010, 770880))
).toDS()
files.show()

This gives me the error:

>command-1888379816641405:10: error: type Set takes type parameters
case class Files(s3path: String, ids: Set)

What am I doing wrong?

CodePudding user response：

Set is a parametrized type, so when you declare it in your Files case class, you should define what type is inside your Set, like Set[Int] for a set of integers. So your Files case class definition should be:

case class Files(Record: String, ids: Set[Int])

And so the complete code to create a dataset with a set column:

import org.apache.spark.sql.SparkSession

object ToDataset {

  private val spark = SparkSession.builder()
    .master("local[*]")
    .appName("test-app")
    .config("spark.ui.enabled", "false")
    .config("spark.driver.host", "localhost")
    .getOrCreate()

  def main(args: Array[String]): Unit = {

    import spark.implicits._
    val files = Seq(
      Files("202110260931", Set(770010, 770880)),
      Files("202110260640", Set(770010, 770880)),
      Files("202110260715", Set(770010, 770880))
    ).toDS()
    files.show()
  }

  case class Files(Record: String, ids: Set[Int])

}

that will return the following dataset:

 ------------ ---------------- 
|      Record|             ids|
 ------------ ---------------- 
|202110260931|[770010, 770880]|
|202110260640|[770010, 770880]|
|202110260715|[770010, 770880]|
 ------------ ----------------