Home > Enterprise >  Java Apache Spark inner join between two datesets
Java Apache Spark inner join between two datesets

Time:11-26

Small question regarding how to achieve inner join using Apache Spark and Java please.

I have this very simple piece of code.

final Dataset<Row> dataSetTableLeft = getDatasetForLeft();
final Dataset<Row> dataSetTableRight = getDatasetForRight();
final Dataset<Row> dataSetTableResult = dataSetTableLeft.join(dataSetTableRight);

dataSetTableLeft.show();
dataSetTableRight.show();
dataSetTableResult.show();

The first table dataSetTableLeft.show , looks like this, very straightforward.

 ---------- ----- 
|      time|label|
 ---------- ----- 
|1637020800|    0|
|1637107200|    0|
|1637193600|    0|
|1637280000|    0|
|1637366400|    0|
|1637452800|    0|
 ---------- ----- "

The second table dataSetTableRight, looks like this, also very straightforward.

 ---------- ----- 
|      time|label|
 ---------- ----- 
|1637193600|    1|
|1637280000|    2|
|1637366400|    1|
 ---------- ----- "

What I would like to achieve is a result table like this, which is a kind of inner join.

 ---------- ----- 
|      time|label|
 ---------- ----- 
|1637020800|    0|
|1637107200|    0|
|1637193600|    1|
|1637280000|    2|
|1637366400|    1|
|1637452800|    0|
 ---------- ----- "

Unfortunately, I am not seeing any function like this. innerjoin()

Therefore, I am trying some kind of combinations

dataSetTableLeft.unionAll(dataSetTableRight);
dataSetTableLeft.crossJoin(dataSetTableRight);

without any luck so far.

What is the correct way to achieve the inner join?

CodePudding user response:

Inner join will only give you rows that exist in the two datasets. What you want here is to keep all rows from the dataSetTableLeft and retrieve label value from dataSetTableRight when time matches.

For this, simply use left join and coalesce function to get label value from the second dataset when there is a match.

I'm not using Spark-Java to test it but something like this should give you the expected result:

import static org.apache.spark.sql.functions.*;


dataSetTableLeft.join(
    dataSetTableRight,
    dataSetTableLeft.col("time").equalTo(dataSetTableRight.col("time")),
    "left_outer"
).select(
    dataSetTableLeft.col("time"),
    coalesce(dataSetTableRight.col("label"), dataSetTableLeft.col("label")).alias("label")
).show();

CodePudding user response:

Execute anti join first, and then union.

dataSetTableLeft.join(dataSetTableRight, on='time', how='anti').union(dataSetTableRight)
  • Related