Home > Enterprise >  Does pyspark DataFrame.randomSplit() return a stable split?
Does pyspark DataFrame.randomSplit() return a stable split?

Time:11-25

I am working on a problem with a smallish dataset. Overtime new data is collected and I would like to add this new data to my dataset. I have made a unique identifier in my current dataset and I have used randomSplit to split this into a train and test set:

train, test = unique_lifetimes_spark_df.select("lifetime_id").distinct().randomSplit(weights=[0.8, 0.2], seed=42)

If I now update my dataset and re-run the split, will it produce a stable split only expanding the existing groups or do I run the risk of "polluting" my groups?

As an example consider the dataset [A, B, C, D, E]. randomSplit splits this into two groups:

group 1: [A, B, C, D]

group 2: [E]

At some point I get an updated dataset [A, B, C, D, E, F, G] which I would like to incorporate into my modelling process. By using randomSplit, am I guaranteed that the returned split would never have [A, B, C, D] which originally appeared in group 1 in group 2 and vice versa [E] will never appear in group 1 as it originally appeared in group 2?

So the updated split should randomly decide where to put [F, G] and the rest should stay in their previously assigned groups.

CodePudding user response:

No, you do not have the guarantee that if the original dataset grows, the split will remain the same for the pre existing elements.

You can test it yourself:

scala> spark.range(5).randomSplit(Array(0.8, 0.2), seed=42).foreach(_.show)
 --- 
| id|
 --- 
|  1|
|  2|
 --- 

 --- 
| id|
 --- 
|  0|
|  3|
|  4|
 --- 


scala> spark.range(6).randomSplit(Array(0.8, 0.2), seed=42).foreach(_.show)
 --- 
| id|
 --- 
|  0|
|  1|
|  2|
|  3|
|  4|
 --- 

 --- 
| id|
 --- 
|  5|
 --- 
  • Related