Home > Enterprise >  How to use separate key lists to perform a join between two DataFrames?
How to use separate key lists to perform a join between two DataFrames?

Time:12-15

I want to join two different DataFrames (dfA and dfB) built as follows :

dfA.show()
 ----- ------- ------- 
| id_A| name_A|address|
 ----- ------- ------- 
|    1|   AAAA|  Paris|
|    4|   DDDD| Sydney|
 ----- ------- ------- 

dfB.show()
 ----- ------- --------- 
| id_B| name_B|      job|
 ----- ------- --------- 
|    1|   AAAA|  Analyst|
|    2|   AERF| Engineer|
|    3|   UOPY| Gardener|
|    4|   DDDD|  Insurer|
 ----- ------- --------- 

I need to use the following lists in order to do the join :

val keyListA = List("id_A", "name_A")
val keyListB = List("id_B", "name_B")

A simple solution would be :

val join = dfA.join(
  dfA("id_A") === dfB("id_B") &&
  dfA("name_A") === dfB("name_B"),
"left_outer")

Is there a syntax that would allow you to do this join by using the keyListA and keyListB lists ?

CodePudding user response:

If you really want to build your join expression from lists of column names:

import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions._

val dfA: DataFrame = ???
val dfB: DataFrame = ???

val keyListA = List("id_A", "name_A", "property1_A", "property2_A", "property3_A")
val keyListB = List("id_B", "name_B", "property1_B", "property2_B", "property3_B")


def joinExprsFrom(keyListA: List[String], keyListB: List[String]): Column = 
  keyListA
    .zip(keyListB)
    .map { case (fromA, fromB) => col(fromA) === col(fromB) }
    .reduce((acc, expr) => acc && expr )

dfA.join(
  dfB,
  joinExprsFrom(keyListA, keyListB),
  "left_outer")

You need to make sure keyListA and keyListB are the same size and non-empty.

  • Related