Home > Software engineering >  How to create a Pipeline for preprocessing Y_train?
How to create a Pipeline for preprocessing Y_train?

Time:09-01

Generally, Pipelines are used like this: pipe.fit(X_train, y_train). All transformer methds are fitted and transformed on X_train. Y train is only used for fitting the model. How I can construct a pipeline that transforms y_train? I have y which includes valyes ">=50k" and "<50k". I want to use LabelEncoder as transformer method.

X = df.drop('income', axis=1)
y = df[['income']]

y_preprocessing = Pipeline([
    ("labelencoder", LabelEncoder())
])

preprocessing = ColumnTransformer([
    ("y_preprocessing", y_preprocessing, ['income'])
])

when using

y_preprocessing.fit(y)

It gives a TypeError:

TypeError: fit() takes 2 positional arguments but 3 were given

when using

preprocessing.fit(y)

It also gives a TypeError:

TypeError: fit_transform() takes 2 positional arguments but 3 were given

CodePudding user response:

Conceptually, you don't need to have your labels/targets in the pipeline. Yes, you may need to apply LabelEncoder for y_train. But then imagine a situation when after training you want to do prediction.

Also the sklearn pipeline is quite often used for hyper parameter tuning. Which also does not applicable to targets.

Often this approach should be suitable for most cases:

X = df.drop('income', axis=1)
y = df[['income']]

le = LabelEncoder()
y = le.fit_transform(y)
  • Related