I am joining two datasets where some of their columns share the same name. I would like the output to be tuples of two case classes, each representing their respective dataset.
joined = dataset1.as("ds1")
.join(dataset2.as("ds2"),dataset1("key") === dataset2("key"),"inner")
// select doesn't work because of the columns which have similar names
.select("ds1.*,ds2.*)
// skipping select and going straight here doesn't work because of the same problem
.as[Tuple2(caseclass1,caseclass2)]
What code is needed to let spark know to map ds1.* to type caseclass1 and ds2.* to caseclass2?
CodePudding user response:
You can leverage the struct
function here as follows:
// create a wrapper case class
case class Outer(caseclass1: Caseclass1, caseclass2: Caseclass2)
// join and select the columns as struct
val joined = dataset1.as("ds1")
.join(dataset2.as("ds2"), dataset1("key") === dataset2("key"), "inner")
.select("struct(ds1.*) as caseclass1", "struct(ds2.*) as caseclass2")
.as[Outer]