Generally, Pipelines are used like this: pipe.fit(X_train, y_train). All transformer methds are fitted and transformed on X_train. Y train is only used for fitting the model. How I can construct a pipeline that transforms y_train? I have y which includes valyes ">=50k" and "<50k". I want to use LabelEncoder as transformer method.
X = df.drop('income', axis=1)
y = df[['income']]
y_preprocessing = Pipeline([
("labelencoder", LabelEncoder())
])
preprocessing = ColumnTransformer([
("y_preprocessing", y_preprocessing, ['income'])
])
when using
y_preprocessing.fit(y)
It gives a TypeError:
TypeError: fit() takes 2 positional arguments but 3 were given
when using
preprocessing.fit(y)
It also gives a TypeError:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
CodePudding user response:
Conceptually, you don't need to have your labels/targets in the pipeline. Yes, you may need to apply LabelEncoder for y_train. But then imagine a situation when after training you want to do prediction.
Also the sklearn pipeline is quite often used for hyper parameter tuning. Which also does not applicable to targets.
Often this approach should be suitable for most cases:
X = df.drop('income', axis=1)
y = df[['income']]
le = LabelEncoder()
y = le.fit_transform(y)