I have NumPy matrices I am trying to make it so that my y matrix has equal ones and zeros by deleting the elements. However, the corresponding elements in the x matrix will also need to be removed.
any suggestions are appreciated. Thanks.
x = np.arange(1, 25).reshape(8, 3)
y = np.random.choice([0, 1], size=(8,1), p=[1./3, 2./3])
print(f'x = {x}')
print(f'y = {y}')
x = [[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]
[13 14 15]
[16 17 18]
[19 20 21]
[22 23 24]]
y = [[1]
[1]
[1]
[0]
[1]
[1]
[1]
[0]]
Desired output
x = [[ 1 2 3]
[ 4 5 6]
[10 11 12]
[22 23 24]]
y = [[1]
[1]
[0]
[0]]
CodePudding user response:
After counting repeats of each 0 and 1 in the y
array using np.unique
, we can determine minimum repeats of them. After that, we slice the Boolean arrays for where y
is equal to 0 or 1, by the determined minimum size. So, we can separate the results as we want by combining and sorting the resulted arrays (--> indices from Booleans):
counts = np.unique(y, return_counts=True)[1]
count_min = counts.min()
mask_zero = np.where(y == 0)[0][:count_min]
mask_one = np.where(y == 1)[0][:count_min]
ind = np.sort(np.concatenate((mask_zero, mask_one)))
x_result = x[ind]
y_result = y[ind]
CodePudding user response:
I suggest using pandas:
import pandas
df = pandas.DataFrame(y,columns=['y']) #convert y to dataframe
minelem = min(df.y.value_counts().tolist())) #find class with minimum number of elements
#on index_list you will have indices of rows you want to sample
index_list = df.groupby('y').apply(lambda x: x.sample(minelem).index.get_level_values(None).tolist()
new_x = np.array(x)[index_list]
new_y = np.array(y)[index_list]