Home > Blockchain >  Stratified Sampling with Pyspark Multiples Columns
Stratified Sampling with Pyspark Multiples Columns

Time:01-17

I have a dataframe with millions of registers, like this:

CLI_ID OCCUPA_ID DIG_LABEL
125    2705      1
328    2708      7
400    2712      1
401    2705      2
525    2708      1

I want to take an aleatory sample of 100k rows that contains 70% of 2705, 20% of 2708, 10% of 2712 from OCCUPA_ID and 50% of 1, 20% of 2 and 30% 7 from DIG_LABEL.

How can I get this in Spark, using pyspark?

CodePudding user response:

use sampleBy instead using sample function in pyspark,becasue sample only use for sampling without any column.so we will take sampleBy. here,sampleBy we have in column,fraction and seed(Optional). consider,

df_sample = df.sampleBy(column,fraction,seed)

where,

  • column is defined for selecting column you want to sampling
  • fraction is just defined for sampling ration like 10% so it will take as 0.1 vice versa.
  • seed for which of data show will saved as seed through becasue everytime it will show different data if not use this seed.

so your question required answer is,

dfsample = df.sampleBy("OCCUPA_ID",{"2705":0.7,"2708":0.2,"2712":0.1},42).sampleBy("DIG_LABEL",{"1":0.5,"2":0.2,"7":0.3},42)

just take two times of sampling OCCUPA_ID and after DIG_LABEL.

  • 42 is seed here both time

CodePudding user response:

You can use the sampleBy method for pyspark DataFrames to perform stratified sample and pass the column name and a dictionary for the fractions within each column. For example:

spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).show()

 ------ --------- --------- 
|CLI_ID|OCCUPA_ID|DIG_LABEL|
 ------ --------- --------- 
|     1|     2705|        7|
|     4|     2705|        1|
|     5|     2705|        7|
|     7|     2708|        2|
|    12|     2705|        1|
|    16|     2708|        2|
|    18|     2708|        2|
|    20|     2705|        7|
|    25|     2705|        2|
|    26|     2705|        2|
|    38|     2705|        7|
|    40|     2705|        1|
|    44|     2705|        2|
|    48|     2708|        7|
|    50|     2708|        2|
|    53|     2705|        1|
|    57|     2705|        1|
|    58|     2712|        1|
|    61|     2705|        2|
|    63|     2708|        7|
 ------ --------- --------- 
only showing top 20 rows

Since you want one pyspark DataFrame with two samplings performed from two different columns, you can chain the sampleBy methods together:

spark_stratified_sample_df = spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).sampleBy("DIG_LABEL", fractions={"1": 0.5, "2": 0.2, "7": 0.3}, seed=42)
  • Related