Difference Between Cartesian Join and BroadcastNestedLoop join in Spark-CodePudding

I went through several articles but eventually could not exactly figure out what's the exact difference between them. Both of them scan the tables for each record in a cross product manner. They say in BroadcastNestedLoop, smaller table is broadcasted to all worker nodes. How does this shuffling happen in case of Cartesian join? Could you please explain me what exactly is the different between the two join strategies in Spark.

CodePudding user response：

Nested loop join

Nested loop join is the most basic joining algorithm in SQL (not only in Spark). In pseudo-code it can be described as:

O_SET     # outer query set
I_SET     # inner query set
PREDICATE # join predicate

foreach o_row in O_SET:
  foreach i_row in I_SET:
    if i_row matches PREDICATE:
      return (o_row, i_row)

Broadcast Nested Loop Join

In Broadcast Nested Loop Join, one of the input data set is broadcasted to all the executors. After this, each partition of the non-broadcasted input data set is joined to the broadcasted data set using the standard Nested Loop Join procedure to produce the output joined data.

So Broadcast Nested Loop Join doesn't produce "all-to-all" join as Cartesian Join.

Cartesian Join

Cartesian Join is used exclusively to perform cross join between two input data sets. As cross join is a join of "all-to-all" records of two dataframes, the entire partition of the dataset is sent over or replicated to all the partitions.

The number of output partitions is always equal to the product of the number of partitions of the input data sets. For each of the output partition in the output data set, the data is computed by doing a cartesian product on data from two input partitions mapped to the output partition.