Home > other >  Using different features for the same estimator in the pipeline
Using different features for the same estimator in the pipeline

Time:05-18

I have a nice pipeline that does the following:

pipeline = Pipeline([
    ("first transformer", ct),
    ("second transformer", OHE),
    ('standard_scaler', MinMaxScaler()),
    ("logistic regression", estimator)
])

The estimator part is this:

estimator = MultiOutputClassifier(
    estimator = LogisticRegression(penalty="l2", C=2)
)

Label DataFrame is of shape (1000, 2) and all works nicely so far.

To tweak the model I now try to add SelectKBest to limit the features used for calculations. Unfortunately adding this code to the pipeline:

('feature_selection', SelectKBest(score_func=f_regression, k=9))

returns this error:

ValueError: y should be a 1d array, got an array of shape (20030, 2) instead.

I understand where it comes from and using only one label (1000, 1) solves the issue but that means I would need to create two separate pipelines for each label.

Is there any way of including feature selection in this pipeline without resorting to that?

CodePudding user response:

Since you want (potentially) to use a different subset of features for each output, you should just put the SelectKBest in a pipeline with the LogisticRegression inside the MultiOutputClassifier.

clf = Pipeline([
    ("feature_selection", SelectKBest(score_func=f_regression, k=9)),
    ("logistic regression", LogisticRegression(penalty="l2", C=2)),
])
estimator = MultiOutputClassifier(clf)

pipeline = Pipeline([
    ("first transformer", ct),
    ("second transformer", OHE),
    ('standard_scaler', MinMaxScaler()),
    ("select_and_model", estimator),
])
  • Related