I have a dataframe with millions of registers, like this:
CLI_ID OCCUPA_ID DIG_LABEL
125 2705 1
328 2708 7
400 2712 1
401 2705 2
525 2708 1
I want to take an aleatory sample of 100k rows that contains 70% of 2705, 20% of 2708, 10% of 2712 from OCCUPA_ID and 50% of 1, 20% of 2 and 30% 7 from DIG_LABEL.
How can I get this in Spark, using pyspark?
CodePudding user response:
use sampleBy
instead using sample
function in pyspark
,becasue sample
only use for sampling without any column.so we will take sampleBy
.
here,sampleBy
we have in column
,fraction
and seed(Optional)
.
consider,
df_sample = df.sampleBy(column,fraction,seed)
where,
column
is defined for selectingcolumn
you want tosampling
fraction
is just defined forsampling
ration like 10% so it will take as 0.1 vice versa.seed
for which of data show will saved asseed
through becasue everytime it will show different data if not use thisseed
.
so your question required answer is,
dfsample = df.sampleBy("OCCUPA_ID",{"2705":0.7,"2708":0.2,"2712":0.1},42).sampleBy("DIG_LABEL",{"1":0.5,"2":0.2,"7":0.3},42)
just take two times of sampling OCCUPA_ID
and after DIG_LABEL
.
42
isseed
here both time
CodePudding user response:
You can use the sampleBy method for pyspark DataFrames to perform stratified sample and pass the column name and a dictionary for the fractions within each column. For example:
spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).show()
------ --------- ---------
|CLI_ID|OCCUPA_ID|DIG_LABEL|
------ --------- ---------
| 1| 2705| 7|
| 4| 2705| 1|
| 5| 2705| 7|
| 7| 2708| 2|
| 12| 2705| 1|
| 16| 2708| 2|
| 18| 2708| 2|
| 20| 2705| 7|
| 25| 2705| 2|
| 26| 2705| 2|
| 38| 2705| 7|
| 40| 2705| 1|
| 44| 2705| 2|
| 48| 2708| 7|
| 50| 2708| 2|
| 53| 2705| 1|
| 57| 2705| 1|
| 58| 2712| 1|
| 61| 2705| 2|
| 63| 2708| 7|
------ --------- ---------
only showing top 20 rows
Since you want one pyspark DataFrame with two samplings performed from two different columns, you can chain the sampleBy
methods together:
spark_stratified_sample_df = spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).sampleBy("DIG_LABEL", fractions={"1": 0.5, "2": 0.2, "7": 0.3}, seed=42)