I am initializing two multivariate gaussian distributions like so and trying to implement a machine learning algorithm to draw a decision boundary between the classes:
import numpy as np
import matplotlib.pyplot as plt
import torch
import random
mu0 = [-2,-2]
mu1 = [2, 2]
cov = np.array([[1, 0],[0, 1]])
X = np.random.randn(10,2)
L = np.linalg.cholesky(cov)
Y0 = mu0 [email protected]
Y1 = mu1 [email protected]
I have two separated circles and I am trying to stack Y0 and Y1, shuffle them, and then break them into training and testing splits. First I append the class labels to the data, and then stack.
n,m = Y1.shape
class0 = np.zeros((n,1))
class1 = np.ones((n,1))
Y_0 = np.hstack((Y0,class0))
Y_1 = np.hstack((Y1,class1))
data = np.vstack((Y_0,Y_1))
Now when i try to call random.shuffle(data)
the zero class takes over and I get a small number of class one instances.
random.shuffle(data)
Here is my data before shuffling:
print(data)
[[-3.16184428 -1.89491433 0. ]
[ 0.2710061 -1.41000924 0. ]
[-3.50742027 -2.04238337 0. ]
[-1.39966859 -1.57430259 0. ]
[-0.98356629 -3.02299622 0. ]
[-0.49583458 -1.64067853 0. ]
[-2.62577229 -2.32941225 0. ]
[-1.16005269 -2.76429318 0. ]
[-1.88618759 -2.79178253 0. ]
[-1.34790868 -2.10294791 0. ]
[ 0.83815572 2.10508567 1. ]
[ 4.2710061 2.58999076 1. ]
[ 0.49257973 1.95761663 1. ]
[ 2.60033141 2.42569741 1. ]
[ 3.01643371 0.97700378 1. ]
[ 3.50416542 2.35932147 1. ]
[ 1.37422771 1.67058775 1. ]
[ 2.83994731 1.23570682 1. ]
[ 2.11381241 1.20821747 1. ]
[ 2.65209132 1.89705209 1. ]]
and after shufffling:
data
array([[-0.335667 , -0.60826166, 0. ],
[-0.335667 , -0.60826166, 0. ],
[-0.335667 , -0.60826166, 0. ],
[-0.335667 , -0.60826166, 0. ],
[-2.22547604, -1.62833794, 0. ],
[-3.3287687 , -2.37694753, 0. ],
[-3.2915737 , -1.31558952, 0. ],
[-2.23912202, -1.54625136, 0. ],
[-0.335667 , -0.60826166, 0. ],
[-2.23912202, -1.54625136, 0. ],
[-2.11217077, -2.70157476, 0. ],
[-3.25714184, -2.7679462 , 0. ],
[-3.2915737 , -1.31558952, 0. ],
[-2.22547604, -1.62833794, 0. ],
[ 0.73756329, 1.46127708, 1. ],
[ 1.88782923, 1.29842524, 1. ],
[ 1.77452396, 2.37166206, 1. ],
[ 1.77452396, 2.37166206, 1. ],
[ 3.664333 , 3.39173834, 1. ],
[ 3.664333 , 3.39173834, 1. ]])
Why is random.shuffle
deleting my data? I just need all twenty rows to be shuffled, but it is repeating lines and i am losing data. i'm not setting random.shuffle
to a variable and am simply just calling random.shuffle(data)
. Are there any other ways to simply shuffle my data?
CodePudding user response:
Because the swap method used by the random.shuffle
does not work in ndarray:
# Python 3.10.7 random.py
class Random(_random.Random):
...
def shuffle(self, x, random=None):
...
if random is None:
randbelow = self._randbelow
for i in reversed(range(1, len(x))):
# pick an element in x[:i 1] with which to exchange x[i]
j = randbelow(i 1)
x[i], x[j] = x[j], x[i] # <----------------
...
...
Using index on multi-dimensional array will result in a view instead of a copy, which will prevent the swap from working properly. For more information, you can refer to this question.
Better choice numpy.random.Generator.shuffle
:
>>> data
array([[-1.88985877, -2.97312795, 0. ],
[-1.52352452, -2.19633099, 0. ],
[-2.06297352, -1.36627294, 0. ],
[-1.47460488, -2.09410403, 0. ],
[-1.18753167, -1.71069966, 0. ],
[-1.92878766, -1.19545861, 0. ],
[-2.4858627 , -2.66525855, 0. ],
[-2.97169999, -1.46985506, 0. ],
[-2.11395907, -2.19108576, 0. ],
[-2.63976951, -1.66742147, 0. ],
[ 2.11014123, 1.02687205, 1. ],
[ 2.47647548, 1.80366901, 1. ],
[ 1.93702648, 2.63372706, 1. ],
[ 2.52539512, 1.90589597, 1. ],
[ 2.81246833, 2.28930034, 1. ],
[ 2.07121234, 2.80454139, 1. ],
[ 1.5141373 , 1.33474145, 1. ],
[ 1.02830001, 2.53014494, 1. ],
[ 1.88604093, 1.80891424, 1. ],
[ 1.36023049, 2.33257853, 1. ]])
>>> rng = np.random.default_rng()
>>> rng.shuffle(data, 0)
>>> data
array([[-1.92878766, -1.19545861, 0. ],
[-2.97169999, -1.46985506, 0. ],
[ 2.07121234, 2.80454139, 1. ],
[ 1.36023049, 2.33257853, 1. ],
[ 1.93702648, 2.63372706, 1. ],
[-2.11395907, -2.19108576, 0. ],
[-2.63976951, -1.66742147, 0. ],
[ 1.02830001, 2.53014494, 1. ],
[ 2.11014123, 1.02687205, 1. ],
[ 1.88604093, 1.80891424, 1. ],
[-1.47460488, -2.09410403, 0. ],
[ 2.52539512, 1.90589597, 1. ],
[-1.18753167, -1.71069966, 0. ],
[-1.88985877, -2.97312795, 0. ],
[ 2.81246833, 2.28930034, 1. ],
[-2.06297352, -1.36627294, 0. ],
[ 1.5141373 , 1.33474145, 1. ],
[-2.4858627 , -2.66525855, 0. ],
[-1.52352452, -2.19633099, 0. ],
[ 2.47647548, 1.80366901, 1. ]])
In this example, numpy.random.shuffle
also works normally because OP just requires shuffling along the first axis, but numpy.random.Generator.shuffle
is the recommended usage in the new code and supports shuffling along other axis.