Home > Blockchain >  Split large dataframe into small ones Spark
Split large dataframe into small ones Spark

Time:07-19

I have a DF that has 200 million lines. I cant group this DF and I have to split this DF in 8 smaller DFs (approx 30 million lines each). I've tried this approach but with no success. Without caching the DF, the count of the splitted DFs does not match the larger DF. If I use cache I get out of disk space (my config is 64gb RAM and 512 SSD).

Considering this, I though about the following approach:

  1. Load the entire DF
  2. Give 8 random numbers to this DF
  3. Distribute the random number evenly in the DF

Consider the following DF as example:

 ------ -------- 
| val1 |  val2  |
 ------ -------- 
|Paul  |    1.5 |
|Bostap|    1   |
|Anna  |    3   |
|Louis |    4   |
|Jack  |    2.5 |
|Rick  |    0   |
|Grimes|    null|
|Harv  |    2   |
|Johnny|    2   |
|John  |    1   |
|Neo   |    5   |
|Billy |    null|
|James |    2.5 |
|Euler |    null|
 ------ -------- 

The DF has 14 lines, I though to use window to create the following DF:

 ------ -------- ---- 
| val1 |  val2  | sep|
 ------ -------- ---- 
|Paul  |    1.5 |1   |
|Bostap|    1   |1   |
|Anna  |    3   |1   |
|Louis |    4   |1   |
|Jack  |    2.5 |1   |
|Rick  |    0   |1   |
|Grimes|    null|1   |
|Harv  |    2   |2   |
|Johnny|    2   |2   |
|John  |    1   |2   |
|Neo   |    5   |2   |
|Billy |    null|2   |
|James |    2.5 |2   |
|Euler |    null|2   |
 ------ -------- ---- 

Considering the last DF, I will use a filter to filter by sep. My doubt is: How can I use window function to generate the column sep of last DF?

CodePudding user response:

is your data has timestamp column ?? if not then it will hard to generate the sep column .

CodePudding user response:

Since you are randomly splitting the dataframe into 8 parts, you could use randomSplit():

split_weights = [1.0] * 8
splits = df.randomSplit(split_weights)
for df_split in splits:
    # do what you want with the smaller df_split

Note that this will not ensure same number of records in each df_split. There may be some fluctuation but with 200 million records it will be negligible.

  • Related