Home > database >  Spark dataset join as tuple of case classes
Spark dataset join as tuple of case classes

Time:11-01

I am joining two datasets where some of their columns share the same name. I would like the output to be tuples of two case classes, each representing their respective dataset.

joined = dataset1.as("ds1")
.join(dataset2.as("ds2"),dataset1("key") === dataset2("key"),"inner")
// select doesn't work because of the columns which have similar names
.select("ds1.*,ds2.*)
// skipping select and going straight here doesn't work because of the same problem
.as[Tuple2(caseclass1,caseclass2)]

What code is needed to let spark know to map ds1.* to type caseclass1 and ds2.* to caseclass2?

CodePudding user response:

You can leverage the struct function here as follows:

// create a wrapper case class
case class Outer(caseclass1: Caseclass1, caseclass2: Caseclass2)

// join and select the columns as struct
val joined = dataset1.as("ds1")
.join(dataset2.as("ds2"), dataset1("key") === dataset2("key"), "inner")
.select("struct(ds1.*) as caseclass1", "struct(ds2.*) as caseclass2")
.as[Outer]

  • Related