split apache spark dataframe into multiple chunk dataframes for crossJoin acceleration-CodePudding

How to split a spark dataframe into multiple dataframe, this can be helpful in case of crossJoin to avoid stucking the cluster

CodePudding user response：

I just developed a new algorithm that split a whole dataframe into multiple dataframes, each chunk of dataframe can be processed alone without stucking the cluster (case of crossJoin)

all the algorithm and the code with example and explanation in this link :

https://medium.com/@djeddi.amin/split-spark-dataframe-into-multiple-small-dataframes-filter-approach-8f7ac36e12c5

feel free to contact me : [email protected]

#First : add incremental column 

df = df.withColumn("idx", F.monotonically_increasing_id())
w = Window().orderBy("idx")
df = df.withColumn("id", (F.row_number().over(w))).drop('idx')
df.show()


#second step : apply the algo ! 
desired_chunks = 3
cnt = df.count()
res = rows_number//desired_chunks
rem = rows_number