I'm trying to workout why my dataframe changes its order once its converted into an array. Below is my code:
header_list = ["output", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15" ,"16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30"]
df = pd.read_csv(('data.csv'), names = header_list)
#Splitting data 70/30 for training and testing sets
trainingdata = df.sample(frac=0.7)
#assigning Y to be the first column, and X as the rest
X = trainingdata.iloc[:,1:].to_numpy()
Y = trainingdata.iloc[:,0].to_numpy().reshape(-1, 1)
print(trainingdata)
output:
output 1 2 ... 28 29 30
12 0 0.267358 0.373690 ... 0.379725 0.130298 0.195592
27 1 0.313739 0.506595 ... 0.456701 0.375517 0.157156
450 0 0.181693 0.490362 ... 0.112165 0.294500 0.139184
440 0 0.033603 0.531958 ... 0.171821 0.241474 0.338187
54 0 0.197312 0.113967 ... 0.189210 0.255076 0.083169
.. ... ... ... ... ... ... ...
20 1 0.519144 0.348326 ... 0.407216 0.653854 0.039814
231 1 0.428274 0.196145 ... 0.680756 0.286615 0.237439
55 0 0.291968 0.190396 ... 0.334089 0.450227 0.205234
159 1 0.410762 0.456206 ... 0.846048 0.337473 0.307359
117 0 0.232335 0.292188 ... 0.391065 0.361128 0.187656
You can see my index column is in a random order, where my original dataframe is in numerical order, have I performed my syntax wrong here to cause this?
CodePudding user response:
This is coming from the sample
operation in pandas. By default it performs a selection on random rows/columns from your dataframe.
Read the documentation about it here.
If you want to have the same selection on every execution of your code (reproducibility) you can use the random_state
option.