How to resolve: ValueError: Input contains NaN, infinity or a value too large for dtype('float3-CodePudding

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.metrics import fbeta_score, make_scorer
import keras.backend as K
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, ClassifierMixin
import pandas as pd

class CustomThreshold(BaseEstimator, ClassifierMixin):
    """ Custom threshold wrapper for binary classification"""
    def __init__(self, base, threshold=0.5):
        self.base = base
        self.threshold = threshold
    def fit(self, *args, **kwargs):
        self.base.fit(*args, **kwargs)
        return self
    def predict(self, X):
        return (self.base.predict_proba(X)[:, 1] > self.threshold).astype(int)

dataset_clinical = np.genfromtxt("/content/drive/MyDrive/Colab Notebooks/BreastCancer-master/Data/stacked_metadata.csv",delimiter=",")
X = dataset_clinical[:,0:450]
Y = dataset_clinical[:,450]
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)
rf = RandomForestClassifier(n_estimators=10).fit(X,Y) 
clf = [CustomThreshold(rf, threshold) for threshold in [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]]

for model in clf:
    print(confusion_matrix(y_test, model.predict(X_test)))
for model in clf:
    print(confusion_matrix(Y, model.predict(X)))

*The traceback displays the following: Traceback (most recent call last):

File "RF.py", line 33, in rf = RandomForestClassifier(n_estimators=10).fit(X,Y)

File "/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py", line 328, in fit X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE

File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 576, in _validate_data X, y = check_X_y(X, y, **check_params)

File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 968, in check_X_y estimator=estimator,

File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 792, in check_array_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")

File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 116, in _assert_all_finite type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). *

CodePudding user response：

This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

Try removing your unexpected values by:

np.any(np.isnan(your_matrix))
np.all(np.isfinite(your_matrix))

CodePudding user response：

At first glance I would say check your dataset for missing values, outliers etc.

A big part of any ML model is data exploration and preprocessing. I found a guide for that, for beginners. Pandas: https://towardsdatascience.com/data-visualization-exploration-using-pandas-only-beginner-a0a52eb723d5