balancing numpy array dataset x and y matrices-CodePudding

I have NumPy matrices I am trying to make it so that my y matrix has equal ones and zeros by deleting the elements. However, the corresponding elements in the x matrix will also need to be removed.

any suggestions are appreciated. Thanks.

x = np.arange(1, 25).reshape(8, 3)
y = np.random.choice([0, 1], size=(8,1), p=[1./3, 2./3])
print(f'x = {x}')
print(f'y = {y}')       

x = [[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]
 [13 14 15]
 [16 17 18]
 [19 20 21]
 [22 23 24]]
y = [[1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]]

Desired output

x = [[ 1  2  3]
 [ 4  5  6]
 [10 11 12]
 [22 23 24]]
y = [[1]
 [1]
 [0]
 [0]]

CodePudding user response：

After counting repeats of each 0 and 1 in the y array using np.unique, we can determine minimum repeats of them. After that, we slice the Boolean arrays for where y is equal to 0 or 1, by the determined minimum size. So, we can separate the results as we want by combining and sorting the resulted arrays (--> indices from Booleans):

counts = np.unique(y, return_counts=True)[1]
count_min = counts.min()
mask_zero = np.where(y == 0)[0][:count_min]
mask_one = np.where(y == 1)[0][:count_min]
ind = np.sort(np.concatenate((mask_zero, mask_one)))
x_result = x[ind]
y_result = y[ind]

CodePudding user response：

I suggest using pandas:

import pandas

df = pandas.DataFrame(y,columns=['y']) #convert y to dataframe
minelem = min(df.y.value_counts().tolist())) #find class with minimum number of elements
#on index_list you will have indices of rows you want to sample
index_list = df.groupby('y').apply(lambda x: x.sample(minelem).index.get_level_values(None).tolist()

new_x = np.array(x)[index_list]
new_y = np.array(y)[index_list]