Home > database >  Use train_test_split to output into dictionaries
Use train_test_split to output into dictionaries

Time:02-18

So I thought I was smart (turns out I am not), I would like to to use the train_test_split function to output its data directly into a dictionary that I created beforehand.

{"train":{"x":None, "y":None},
 "val":{"x":None, "y":None},
 "test":{"x":None, "y":None}}

I would like to feed into each "x" and "y" the relevant data using the aforementioned split function. However, when I do this all "x" entries in the dictionary receive the same data. So datasets["train"]["x"] will have the same values as datasets["test"]["x"]. You can see that this true by using the example below.

from sklearn.model_selection import train_test_split
import numpy as np
x = np.random.rand(200,10)
y = np.random.rand(200)

splits =  [0.8,0.1,0.1]
sets = ["train","val", "test"]
datasets = dict(zip(sets[:len(splits)], [{"x":None, "y":None}]*len(splits) ))

datasets["train"]["x"],_,_,_  = train_test_split(x, y, train_size=0.8, random_state=42)
print(datasets["train"]["x].shape[0],datasets["test"]["x"].shape[0])

I guess my approach does not work. So I would like to know why, for learning reasons. And also how to achieve the thing that I was set out to achieve.

datasets["train"]["x"],datasets["val"]["x"],datasets["train"]["y"],datasets["val"]["y"]  = train_test_split(x, y, train_size=0.8, random_state=42)
print(datasets["train"]["x].shape[0],datasets["test"]["x"].shape[0])

This should (in my midn result into datasets["test"]["x"] still being None`

CodePudding user response:

works in my hands:

from sklearn.model_selection import train_test_split
import numpy as np
x = np.random.rand(200,10)
y = np.random.rand(200)

q = {'a': {'a':None, 'b': None}, 'b': {'a':None, 'b': None}}
q['a']['a'], q['a']['b'], q['b']['a'], q['b']['b'] = train_test_split(x, y, train_size=0.8, random_state=42)

CodePudding user response:

The issue stems from how I initialize my Dictionary

splits = [0.8,0.1,0.1]
datasets = dict(zip(sets[:len(splits)], [{"x":None, "y":None}]*len(splits) ))

will result in the above mentioned error. To fix this you can just generate the dictionary with:

splits = [0.8,0.1,0.1]
datasets = {sets[i]:{"x": None, "y":None} for i in range(len(splits) )}

I assume maybe the first way leads to some shared memory. But I am too much of a CS Noob that I would not trust my thinking on that.

  • Related