Why am I getting nan values in prediction with one-vs-rest decision tree, sklearn-CodePudding

The problem is that I want to get array with possibility of every class in my prediction, to get a ROC curve plot, but in some values I get nan for every class.

Lettr is column that I want to predict.

Here is an example data base:


    df = pd.DataFrame({'lettr': ['T','I','D','N','G','S','B','A','J','M','X','O','G','M','R','F','O','C','T', 'J'],
            'x-box': [2, 5, 4, 7, 2, 4, 4, 1, 2, 11, 3, 6, 4, 6, 5, 6, 3, 7, 6, 2],
            'y-box': [8, 12, 11, 11, 1, 11, 2, 1, 2, 15, 9, 13, 9, 9, 9, 9, 4, 10, 11, 2],
            'width': [3, 3, 6, 6, 3, 5, 5, 3, 4, 13, 5, 4, 6, 8, 5, 5, 4, 5, 6, 3],
            'high': [5, 7, 8, 6, 1, 8, 4, 2, 4, 9, 7, 7, 7, 6, 7, 4, 3, 5, 8, 3],
            'onpix':[1, 2, 6, 3, 1, 3, 4, 1, 2, 7, 4, 4, 6, 9, 6, 3, 2, 2, 5, 1],
            'x-bar':[8, 10, 10, 5, 8, 8, 8, 8, 10, 13, 8, 6, 7, 7, 6, 10, 8, 6, 6, 10],
            'y-bar':[13, 5, 6, 9, 6, 8, 7, 2, 6, 2, 7, 7, 8, 8, 11, 6, 7, 8, 11, 6],
            'x2bar':[0, 5, 2, 4, 6, 6, 6, 2, 2, 6, 3, 6, 6, 6, 7, 3, 7, 6, 5, 3],
            'y2bar':[6, 4, 6, 6, 6, 9, 6, 2, 6, 2, 8, 3, 2, 5, 3, 5, 5, 8, 6, 6],
            'xybar':[6, 13, 10, 4, 6, 5, 7, 8, 12, 12, 5, 10, 6, 7, 7, 10, 7, 11, 11, 12],
            'x2ybr':[10, 3, 3, 4, 5, 6, 6, 2, 4, 1, 6, 7, 5, 5, 3, 5, 6, 7, 9, 4],
            'xy2br':[8, 9, 7, 10, 9, 6, 6, 8, 8, 9, 8, 9, 11, 8, 9, 7, 8, 11, 4, 9],
            'x-ege':[0, 2, 3, 6, 1, 0, 2, 1, 1, 8, 2, 5, 4, 8, 2, 3, 2, 2, 3, 0],
            'xegvy':[8, 8, 7, 10, 7, 8, 8, 6, 6, 1, 8, 9, 8, 9, 7, 9, 8, 8, 12, 7],
            'y-ege':[0, 4, 3, 2, 5, 9, 7, 2, 1, 1, 6, 5, 7, 8, 5, 6, 3, 5, 2, 1],
            'yegvx':[8, 10, 9, 8, 10, 7, 10, 7, 7, 8, 7, 8, 8, 6, 11, 9, 8, 9, 4, 7],
            })

I split my data:


    y = df.iloc[:, 0]
    X = df.iloc[:,1:]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Then creating a prediction


    RF = OneVsRestClassifier(DecisionTreeClassifier())
    RF.fit(X_train,y_train)
    y_pred = RF.predict(X_test)
    pred_prob = RF.predict_proba(X_test)

And the y_pred always works fine and gives me arrey of predicted classes:


    print(y_pred)

Out:
array(['G', 'T', 'B', 'T'], dtype='<U1')

But the pred_prob is returning some arrays filed which nan values:


    print(pred_prob)

Out:
array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])

My orginal training data contains 4k rows and usualu about 500 of them came out as nan, Whats more i do not have this problem with KNN clasificator. I do not have any lack of values in dataset, all of my columns are int, expect of "lettr".

CodePudding user response：

from my perspective, the train/test split of your data does not contain all the target labels, thats why the classfier gives a "nan" probability because the model was trained on different set of target labels.

Ex.

we have 4 target labels -> ["A", "B", "C", "D"]
train data has target labels -> ["A", "B", "C]
test data has target labels -> ["A", "B" ,"D"]
Now, the model was trained to predict for class "A" , "B" and "C" but when the test data which also has class "D", the model does not know how to interpret this so it is returning "nan" as the predicted probability

One way to deal with this is to use stratify=True in train_test_split

CodePudding user response：

The OneVsRestClassifier fits a separate tree for each class, and normalizes the probabilities given by each tree by dividing by their sum. When all those probabilities are zero, that results in NaN for all of them.

It's fairly rare to get predicted probabilities of exactly zero or one, but a fully grown decision tree will get there. Given that your other examples give predicted probabilities of 0 or 1, this seems the likely culprit.

You could reduce overfitting of your trees by setting some hyperparameters, especially max_depth. Or, since decision trees can handle multiclass natively, just drop the OneVsRestClassifier.