Stratified Sampling in Python without scikit-learn-CodePudding

I have a vector which contains 10 values of sample 1 and 25 values of sample 2.

Fact = np.array((2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2))

I want to create a stratified output vector where :

sample 1 is divided in 80% : 8 values of 1 and 20% : 2 values of 0.

sample 2 is divided in 80% : 20 values of 1 and 20% : 5 values of 0.

The expected output will be :

Output = np.array((0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1))

How can I automate this ? I can’t use the sampling function from scikit-learn because it is not for a machine learning experience.

CodePudding user response：

Here is one way to get your desired result, with reproducibility of output added. We draw random index values for each of the two groups from the input (fact) array, without replacement. Then, we create a new output array where we assign 1's in locations corresponding to the drawn index values and assign 0's everywhere else.

import numpy as np
from numpy.random import RandomState

rng = RandomState(123)

fact = np.array(
    (2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2),
    dtype='int8'
)

idx_arr = np.hstack(
    (
        rng.choice(np.argwhere(fact == 1).flatten(), 8, replace=False),
        rng.choice(np.argwhere(fact == 2).flatten(), 20, replace=False),
    )
)

out = np.zeros_like(fact, dtype='int8')
np.put(out, idx_arr, 1)

print(out)
# [0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1]