I have a vector which contains 10 values of sample 1 and 25 values of sample 2.
Fact = np.array((2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2))
I want to create a stratified output vector where :
sample 1 is divided in 80% : 8 values of 1 and 20% : 2 values of 0.
sample 2 is divided in 80% : 20 values of 1 and 20% : 5 values of 0.
The expected output will be :
Output = np.array((0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1))
How can I automate this ? I can’t use the sampling function from scikit-learn because it is not for a machine learning experience.
CodePudding user response:
Here is one way to get your desired result, with reproducibility of output added. We draw random index values for each of the two groups from the input (fact
) array, without replacement. Then, we create a new output array where we assign 1
's in locations corresponding to the drawn index values and assign 0
's everywhere else.
import numpy as np
from numpy.random import RandomState
rng = RandomState(123)
fact = np.array(
(2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2),
dtype='int8'
)
idx_arr = np.hstack(
(
rng.choice(np.argwhere(fact == 1).flatten(), 8, replace=False),
rng.choice(np.argwhere(fact == 2).flatten(), 20, replace=False),
)
)
out = np.zeros_like(fact, dtype='int8')
np.put(out, idx_arr, 1)
print(out)
# [0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1]