I have a nice pipeline that does the following:
pipeline = Pipeline([
("first transformer", ct),
("second transformer", OHE),
('standard_scaler', MinMaxScaler()),
("logistic regression", estimator)
])
The estimator part is this:
estimator = MultiOutputClassifier(
estimator = LogisticRegression(penalty="l2", C=2)
)
Label DataFrame is of shape (1000, 2) and all works nicely so far.
To tweak the model I now try to add SelectKBest to limit the features used for calculations. Unfortunately adding this code to the pipeline:
('feature_selection', SelectKBest(score_func=f_regression, k=9))
returns this error:
ValueError: y should be a 1d array, got an array of shape (20030, 2) instead.
I understand where it comes from and using only one label (1000, 1) solves the issue but that means I would need to create two separate pipelines for each label.
Is there any way of including feature selection in this pipeline without resorting to that?
CodePudding user response:
Since you want (potentially) to use a different subset of features for each output, you should just put the SelectKBest
in a pipeline with the LogisticRegression
inside the MultiOutputClassifier
.
clf = Pipeline([
("feature_selection", SelectKBest(score_func=f_regression, k=9)),
("logistic regression", LogisticRegression(penalty="l2", C=2)),
])
estimator = MultiOutputClassifier(clf)
pipeline = Pipeline([
("first transformer", ct),
("second transformer", OHE),
('standard_scaler', MinMaxScaler()),
("select_and_model", estimator),
])