How to split a spark dataframe into multiple dataframe, this can be helpful in case of crossJoin to avoid stucking the cluster
CodePudding user response:
I just developed a new algorithm that split a whole dataframe into multiple dataframes, each chunk of dataframe can be processed alone without stucking the cluster (case of crossJoin)
all the algorithm and the code with example and explanation in this link :
feel free to contact me : [email protected]
#First : add incremental column
df = df.withColumn("idx", F.monotonically_increasing_id())
w = Window().orderBy("idx")
df = df.withColumn("id", (F.row_number().over(w))).drop('idx')
df.show()
#second step : apply the algo !
desired_chunks = 3
cnt = df.count()
res = rows_number//desired_chunks
rem = rows_number