Stratified Sampling with Pyspark Multiples Columns-CodePudding

I have a dataframe with millions of registers, like this:

CLI_ID OCCUPA_ID DIG_LABEL
125    2705      1
328    2708      7
400    2712      1
401    2705      2
525    2708      1

I want to take an aleatory sample of 100k rows that contains 70% of 2705, 20% of 2708, 10% of 2712 from OCCUPA_ID and 50% of 1, 20% of 2 and 30% 7 from DIG_LABEL.

How can I get this in Spark, using pyspark?

CodePudding user response：

use sampleBy instead using sample function in pyspark,becasue sample only use for sampling without any column.so we will take sampleBy. here,sampleBy we have in column,fraction and seed(Optional). consider,

df_sample = df.sampleBy(column,fraction,seed)

where,

column is defined for selecting column you want to sampling
fraction is just defined for sampling ration like 10% so it will take as 0.1 vice versa.
seed for which of data show will saved as seed through becasue everytime it will show different data if not use this seed.

so your question required answer is,

dfsample = df.sampleBy("OCCUPA_ID",{"2705":0.7,"2708":0.2,"2712":0.1},42).sampleBy("DIG_LABEL",{"1":0.5,"2":0.2,"7":0.3},42)

just take two times of sampling OCCUPA_ID and after DIG_LABEL.

42 is seed here both time

CodePudding user response：

You can use the sampleBy method for pyspark DataFrames to perform stratified sample and pass the column name and a dictionary for the fractions within each column. For example:

spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).show()

 ------ --------- --------- 
|CLI_ID|OCCUPA_ID|DIG_LABEL|
 ------ --------- --------- 
|     1|     2705|        7|
|     4|     2705|        1|
|     5|     2705|        7|
|     7|     2708|        2|
|    12|     2705|        1|
|    16|     2708|        2|
|    18|     2708|        2|
|    20|     2705|        7|
|    25|     2705|        2|
|    26|     2705|        2|
|    38|     2705|        7|
|    40|     2705|        1|
|    44|     2705|        2|
|    48|     2708|        7|
|    50|     2708|        2|
|    53|     2705|        1|
|    57|     2705|        1|
|    58|     2712|        1|
|    61|     2705|        2|
|    63|     2708|        7|
 ------ --------- --------- 
only showing top 20 rows

Since you want one pyspark DataFrame with two samplings performed from two different columns, you can chain the sampleBy methods together:

spark_stratified_sample_df = spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).sampleBy("DIG_LABEL", fractions={"1": 0.5, "2": 0.2, "7": 0.3}, seed=42)