Home > Software design >  Spark Dataframe cartesion product by columns
Spark Dataframe cartesion product by columns

Time:05-30

Is there any way to do a column-wise Cartesian join in Spark?

For example, given a dataframe

 ------ ------ ------ 
|col_a |col_b |col_c |
 ------ ------ ------ 
|0     |10    |100   |
|0     |20    |200   |
|0     |30    |300   |
|0     |40    |400   |
 ------ ------ ------ 

The result is a group of dataframes

 ------ ------ 
|col_a |col_b |
 ------ ------ 
|0     |10    |
|0     |20    |
|0     |30    |
|0     |40    |
 ------ ------ 

 ------ ------ 
|col_a |col_c |
 ------ ------ 
|0     |100   |
|0     |200   |
|0     |300   |
|0     |400   |
 ------ ------ 

 ------ ------ 
|col_b |col_c |
 ------ ------ 
|10    |100   |
|20    |200   |
|30    |300   |
|40    |400   |
 ------ ------ 

I'm aware that it can be done in code (by creating a list of column name tuples and selecting through iteration), but I'd like to leverage spark parallelism if possible by calling the same UDF on all of them, ie. something similar to groupby().apply(). Is this possible?

I'm using Spark 3.1.1 with pyspark

Thanks

CodePudding user response:

Your issue has no connection with the Spark parallelism. It is not a cartesion product, it is simply a combination of columns.
A select is a simple spark transformation, it will be imediate to excute and you just need python for that :

from itertools import combinations


df_list = [df.select(*cols) for cols in combinations(df.columns, 2)]

Result is :

df_list 
[DataFrame[col_a: bigint, col_b: bigint],
 DataFrame[col_a: bigint, col_c: bigint],
 DataFrame[col_b: bigint, col_c: bigint]]
  • Related