So I thought I was smart (turns out I am not), I would like to to use the train_test_split
function to output its data directly into a dictionary that I created beforehand.
{"train":{"x":None, "y":None},
"val":{"x":None, "y":None},
"test":{"x":None, "y":None}}
I would like to feed into each "x"
and "y"
the relevant data using the aforementioned split function. However, when I do this all "x"
entries in the dictionary receive the same data. So datasets["train"]["x"]
will have the same values as datasets["test"]["x"]
. You can see that this true by using the example below.
from sklearn.model_selection import train_test_split
import numpy as np
x = np.random.rand(200,10)
y = np.random.rand(200)
splits = [0.8,0.1,0.1]
sets = ["train","val", "test"]
datasets = dict(zip(sets[:len(splits)], [{"x":None, "y":None}]*len(splits) ))
datasets["train"]["x"],_,_,_ = train_test_split(x, y, train_size=0.8, random_state=42)
print(datasets["train"]["x].shape[0],datasets["test"]["x"].shape[0])
I guess my approach does not work. So I would like to know why, for learning reasons. And also how to achieve the thing that I was set out to achieve.
datasets["train"]["x"],datasets["val"]["x"],datasets["train"]["y"],datasets["val"]["y"] = train_test_split(x, y, train_size=0.8, random_state=42)
print(datasets["train"]["x].shape[0],datasets["test"]["x"].shape[0])
This should (in my midn result into datasets["test"]["x"] still being
None`
CodePudding user response:
works in my hands:
from sklearn.model_selection import train_test_split
import numpy as np
x = np.random.rand(200,10)
y = np.random.rand(200)
q = {'a': {'a':None, 'b': None}, 'b': {'a':None, 'b': None}}
q['a']['a'], q['a']['b'], q['b']['a'], q['b']['b'] = train_test_split(x, y, train_size=0.8, random_state=42)
CodePudding user response:
The issue stems from how I initialize my Dictionary
splits = [0.8,0.1,0.1]
datasets = dict(zip(sets[:len(splits)], [{"x":None, "y":None}]*len(splits) ))
will result in the above mentioned error. To fix this you can just generate the dictionary with:
splits = [0.8,0.1,0.1]
datasets = {sets[i]:{"x": None, "y":None} for i in range(len(splits) )}
I assume maybe the first way leads to some shared memory. But I am too much of a CS Noob that I would not trust my thinking on that.