I am tasked with a supervised learning problem on a dataset and want to create a full Pipeline from complete beginning to end. Starting with the train-test splitting. I wrote a custom class to implement sklearns train_test_split into the sklearn pipeline. Its fit_transform returns the training set. Later i still want to accsess the test set, so i made it an instance variable in the custom transformer class like this:
self.test_set = test_set
from sklearn.model_selection import train_test_split
class train_test_splitter([...])
[...
...]
def transform(self, X):
train_set, test_set = train_test_split(X, test_size=0.2)
self.test_set = test_set
return train_set
split_pipeline = Pipeline([
('splitter', train_test_splitter() ),
])
df_train = split_pipeline.fit_transform(df)
Now i want to get the test set like this:
df_test = splitter.test_set
Its not working. How do I get the variables of the instance "splitter". Where does it get stored?
CodePudding user response:
You can access the steps of a pipeline in a number of ways. For example,
split_pipeline['splitter'].test_set
That said, I don't think this is a good approach. When you fill out the pipeline with more steps, at fit
time everything will work how you want, but when predicting/transforming on other data you will still be calling your transform
method, which will generate a new train-test split, forgetting the old one, and sending the new train set down the pipe for the remaining steps.