I'm doing CNN and trying split my data into training and testing datasets. After splitting, I want to use sklearn.preprocessing.StandardScaler
to scale my testing data with the parameters of training data.
So before scaling, I need to split the data. I'm gonna use sklearn.model_selection.train_test_split
, but to use that method I have to convert my data into a pandas.DataFrame
. Since my data are for CNN, their lengths don't meet the requirements of a DataFrame
print(x.shape, delta.shape, z.shape, y.shape, non_spatial_data.shape, p.shape, g.shape)
# (15000, 175) (15000, 175) (15000, 175) (15000, 1225) (15000, 264) (15000, 175) (15000, 175)
The above are the sizes of my data after being flattened. 15000 is the sample size. You can see the lengths of different data are different, which makes me unable to convert them into DataFrame. So how can I do the splitting only using numpy? Or is there any other method to do the whole splitting and scaling process?
PS: The data I am using for CNN are not really images. They are some data with spatial properties.
CodePudding user response:
Here's working example:
import pandas as pd
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
b = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
c = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
n = 0.2
spl = None
for arr in [a, b, c]:
if spl is None:
rand_ind = np.random.choice(range(len(arr)), len(arr))
spl, remaining = np.split(rand_ind, [int(n * len(rand_ind))])
print([arr[i] for i in spl])
CodePudding user response:
From Mr. svfat's answer, I find that I am overthinking. And my question has made Mr. svfat overthink, too. In summary, to split data, we just need to randomly pick a fixed ratio of the indices and apply them to slice each data.