Is there any way to do a column-wise Cartesian join in Spark?
For example, given a dataframe
------ ------ ------
|col_a |col_b |col_c |
------ ------ ------
|0 |10 |100 |
|0 |20 |200 |
|0 |30 |300 |
|0 |40 |400 |
------ ------ ------
The result is a group of dataframes
------ ------
|col_a |col_b |
------ ------
|0 |10 |
|0 |20 |
|0 |30 |
|0 |40 |
------ ------
------ ------
|col_a |col_c |
------ ------
|0 |100 |
|0 |200 |
|0 |300 |
|0 |400 |
------ ------
------ ------
|col_b |col_c |
------ ------
|10 |100 |
|20 |200 |
|30 |300 |
|40 |400 |
------ ------
I'm aware that it can be done in code (by creating a list of column name tuples and selecting through iteration), but I'd like to leverage spark parallelism if possible by calling the same UDF on all of them, ie. something similar to groupby().apply(). Is this possible?
I'm using Spark 3.1.1 with pyspark
Thanks
CodePudding user response:
Your issue has no connection with the Spark parallelism. It is not a cartesion product, it is simply a combination of columns.
A select
is a simple spark transformation, it will be imediate to excute and you just need python for that :
from itertools import combinations
df_list = [df.select(*cols) for cols in combinations(df.columns, 2)]
Result is :
df_list
[DataFrame[col_a: bigint, col_b: bigint],
DataFrame[col_a: bigint, col_c: bigint],
DataFrame[col_b: bigint, col_c: bigint]]