Simultaneously predict fields with sklearn-CodePudding

I wanted to simultaneously predict the total volume and volume of class A, B, C, D, and 0. Is it possible to do this with the code below? Do I have to add a field in the dataset with what I'm predicting?

import pandas as pd

dataset = pd.read_csv('dataAnalysis/data/featureEngineering/data.csv')

X = dataset.iloc[:, 0:20].values
y = dataset.iloc[:, 21:26].values
# 21 - 'VOLUME_CLASSE_A'
# 22 - 'VOLUME_CLASSE_B'
# 23 - 'VOLUME_CLASSE_C'
# 24 - 'VOLUME_CLASSE_D'
# 25 - 'VOLUME_CLASSE_0'
# 26 - 'TOTAL_VOLUME'

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.2, random_state=1)

regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X_train, y_train)
test_predictions = regressor.predict(X_test)

CodePudding user response：

Yes it's possible to do that with the code below since RandomForestRegressor supports multioutput. In case you want to try different models that don't support multioutput you should use either MultiOutputRegressor or RegressorChain from sklearn multioutput.

In a nutshell MultiOutputRegressor fits a model for each output variable and RegressorChain fits a model for each output variable but it also uses the previous models outputs as inputs this is why it's called Chain. Here is quick demonstration on how to use both strategies with:

from sklearn.datasets import load_linnerud
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.linear_model import Ridge

X, y = load_linnerud(return_X_y=True)

# MultiOutputRegressor
multi= MultiOutputRegressor(Ridge(random_state=123))
multi.fit(X, y)

# RegressorChain
# Starting with the second output variable, then first then third
re = RegressorChain(base_estimator=Ridge(random_state=123), order=[1,0,2])
re.fit(X,y)

You can find a list of the models that support multioutput and more info here