Multi Label Classification
gender age weight height vitamin_A vitamin_C vitamin_D
0 55 64 128 0 1 0
0 54 72 135 0 1 0
0 82 70 150 1 1 1
0 82 70 150 1 1 1
0 59 64 107 0 1 0
features are gender, age, weight, height
labels are vitamin A, C, D
X = df[['gender', 'age', 'weight', 'height']]
y = df[['vitamin_A', 'vitamin_C', 'vitamin_D']]
I did simple multi label classification model by OneVsRestClassifier.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
LR_pipeline = Pipeline([('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1))])
labels = ['vitamin_A', 'vitamin_C', 'vitamin_D']
acc_classifier = []
for label in labels:
LR_pipeline.fit(X_train, y_train[label])
prediction = LR_pipeline.predict(X_test)
acc = accuracy_score(y_test[label], prediction)
acc_classifier.append(acc)
df_ = pd.DataFrame({'Label': labels, 'Accuracy': acc_classifier})
df_
Label Accuracy
0 vitamin_A 0.75
1 vitamin_C 0.65
2 vitamin_D 1.00
The original code is in code link. The data is in data link.
But I do not know how to use the model. I did it but it seem wrong result. Because every time I tried it get same only and 1,1,1 only.
data_test = [[0, 82, 70, 150]]
for label in labels:
y_predict = LR_pipeline.predict(data_test)
print(y_predict)
result is [1][1][1]
every time even change number.
My expert is:
Input: gender=0, age=55, weight=64, height=128
Result1: vitamin A is 0, vitamin C is 1, vitamin D is 0
Result2: vitamin A is 0.64, vitamin C is 0.82, vitamin D is 0.34
vitamin_A vitamin_B vitamin_C vitamin_A_prob vitamin_B_prob vitamin_C_prob
0 1 0 0.64 0.82 0.34
CodePudding user response:
First, you're fitting the same model multiple times. The fit
method reinitializes the model, discarding any previously trained parameters.
Second, the provided dataset is not multi-label, because there is a "2" for a row of column "vitamin_A". Supposing this is just a typo, you can directly use OneVsRestClassifier.fit
on the whole dataset, no need for fitting for each label. Just run:
LR_pipeline.fit(X_train, y_train)
prediction = LR_pipeline.predict(X_test)
subset_accuracy = accuracy_score(y_test,prediction)
accuracy_per_label = [accuracy_score(y_test[l],prediction[:,i]) for i,l in enumerate(labels)]
OneVsRestClassifier
does what your training loop is doing: training one binary classifier for each label separately as it if was a binary classification problem. OneVsRestClassifier
is more commonly called Binary Relevance method.
CodePudding user response:
Turn pandas output data frame to numpy y_test.to_numpy() = (samples, n_classes) and do a direct fit; No need for looping each category. This works for me, test inputs are not the same
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
df = pd.read_csv("vitamin.csv")
X = df[['gender', 'age', 'weight', 'height']]
y = df[['vitamin_A', 'vitamin_C', 'vitamin_D']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=42)
clf = OneVsRestClassifier(LogisticRegression(solver='sag')).fit(X_test.to_numpy(), y_test.to_numpy())
data_test1 = [[0, 57, 79, 145]] #
data_test2 = [[0, 59, 64, 107]] #
data_test3= [[0, 89, 74, 107]] #
y_predict1 = clf.predict(data_test1)
y_predict2 = clf.predict(data_test2)
y_predict3 = clf.predict(data_test3)
print(*y_predict1)
print(*y_predict2)
print(*y_predict3)
[1 1 0]
[0 0 0]
[0 0 0]