Can't predict the result. Get wrong result in Multi label classification with OneVsRestClassifi-CodePudding

Multi Label Classification

gender  age weight  height  vitamin_A   vitamin_C   vitamin_D
0       55  64      128     0           1           0
0       54  72      135     0           1           0
0       82  70      150     1           1           1
0       82  70      150     1           1           1
0       59  64      107     0           1           0

features are gender, age, weight, height

labels are vitamin A, C, D

X = df[['gender', 'age', 'weight', 'height']]
y = df[['vitamin_A', 'vitamin_C', 'vitamin_D']]

I did simple multi label classification model by OneVsRestClassifier.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
LR_pipeline = Pipeline([('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1))])
labels = ['vitamin_A', 'vitamin_C', 'vitamin_D']
acc_classifier = []

for label in labels:
    LR_pipeline.fit(X_train, y_train[label])
    prediction = LR_pipeline.predict(X_test)
    acc = accuracy_score(y_test[label], prediction)
    acc_classifier.append(acc)

df_ = pd.DataFrame({'Label': labels, 'Accuracy': acc_classifier})
df_

    Label      Accuracy
0   vitamin_A   0.75
1   vitamin_C   0.65
2   vitamin_D   1.00

The original code is in code link. The data is in data link.

But I do not know how to use the model. I did it but it seem wrong result. Because every time I tried it get same only and 1,1,1 only.

data_test = [[0, 82, 70, 150]] 
for label in labels:
     y_predict = LR_pipeline.predict(data_test)
     print(y_predict)

result is [1][1][1] every time even change number.

My expert is:

Input: gender=0, age=55, weight=64, height=128

Result1: vitamin A is 0, vitamin C is 1, vitamin D is 0

Result2: vitamin A is 0.64, vitamin C is 0.82, vitamin D is 0.34

vitamin_A vitamin_B vitamin_C vitamin_A_prob vitamin_B_prob vitamin_C_prob
0         1         0         0.64           0.82           0.34

CodePudding user response：

First, you're fitting the same model multiple times. The fit method reinitializes the model, discarding any previously trained parameters.

Second, the provided dataset is not multi-label, because there is a "2" for a row of column "vitamin_A". Supposing this is just a typo, you can directly use OneVsRestClassifier.fit on the whole dataset, no need for fitting for each label. Just run:

LR_pipeline.fit(X_train, y_train)
prediction = LR_pipeline.predict(X_test)
subset_accuracy = accuracy_score(y_test,prediction)
accuracy_per_label = [accuracy_score(y_test[l],prediction[:,i]) for i,l in enumerate(labels)]

OneVsRestClassifier does what your training loop is doing: training one binary classifier for each label separately as it if was a binary classification problem. OneVsRestClassifier is more commonly called Binary Relevance method.

CodePudding user response：

Turn pandas output data frame to numpy y_test.to_numpy() = (samples, n_classes) and do a direct fit; No need for looping each category. This works for me, test inputs are not the same

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier


df = pd.read_csv("vitamin.csv")

X = df[['gender', 'age', 'weight', 'height']]
y = df[['vitamin_A', 'vitamin_C', 'vitamin_D']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)
clf = OneVsRestClassifier(LogisticRegression(solver='sag')).fit(X_test.to_numpy(), y_test.to_numpy())


data_test1 = [[0, 57, 79, 145]] #  
data_test2 = [[0, 59, 64, 107]] # 
data_test3= [[0, 89, 74, 107]] # 
y_predict1 = clf.predict(data_test1)
y_predict2 = clf.predict(data_test2)
y_predict3 = clf.predict(data_test3)
print(*y_predict1)
print(*y_predict2)
print(*y_predict3)

[1 1 0]
[0 0 0]
[0 0 0]