Home > Software engineering >  Can a pipeline works for more than one class?
Can a pipeline works for more than one class?

Time:11-03

my dataset consist of 10 feature (10 columns) for input and the last 3 columns for 3 different output. If I use one column for output, for example y = newDf.iloc[:, 10].values , it works; but if I use all 3 columns it gives me an error at pipe_lr.fit and says: y should be a 1d array, got an array of shape (852, 3) instead. How can I pass y ?

X = newDf.iloc[:, 0:10].values
y = newDf.iloc[:, 10:13].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

pipe_lr = make_pipeline(StandardScaler(),
                    PCA(n_components=2),
                    LogisticRegression(random_state=1, solver='lbfgs'))

pipe_lr.fit(X_train, y_train)

CodePudding user response:

The pipeline itself does not care about the format for y, it just hands it over to each step. In your case, it's the LogisticRegression, which indeed is not set up for multi-label classification. You can manage it using the MultiOutputClassification wrapper:

pipe_lr = make_pipeline(
    StandardScaler(),
    PCA(n_components=2),
    MultiOutputClassifier(LogisticRegression(random_state=1, solver='lbfgs'))
)

(There is also a MultiOutputRegressor, and more-complicated things like ClassifierChain and RegressorChain. See the User Guide. However, there is not to my knowledge a builtin way to mix and match regression and classification tasks.)

CodePudding user response:

Simply put, No. What you want is called multi-label learning, not supported by Scikit-Learn.

You should train three models, each having a label.

  • Related