How to fix ValueError: Input contains NaN, infinity or a value too large for dtype('float64&#03-CodePudding

So I'm trying to write a piece of code that can predict the "pr10tournaments" from the csv data. I am running into an error that says

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Here is the code

from os import sep
import sklearn
import tensorflow
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import pickle
from matplotlib import style
from sklearn import linear_model
from sklearn.utils import shuffle 



data = pd.read_csv("pr.csv", sep=";")

data = data[["pr10tournaments", "pr10tournamentsv2", "pr10tournamentsv3", "pr10tournamentsv4", "pr10tournamentsv5"]]



predict = "pr10tournaments"

x = np.array(data.drop([predict], 1))
y = np.array(data[predict])



x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test,y_test)
print(acc)

Here is my csv file

playername;pr10tournaments;pr10tournamentsv2;pr10tournamentsv3;pr10tournamentsv4;pr10tournamentsv5;
"REET";"11410";"4680";"18482";"2345";7175
"Cented";"16225";"8122";"16445";"12740";5897
"Deyy";"10995";"9187";"21375";"6180";13862
"Edgey";"22150";"7087";"17612";5792

I am new to machine learning and I'm not certain but believe the error is with the numbers being too large and if that is the case is there some sort of way around this? Thanks.

CodePudding user response：

The issue is because of the first line of your csv file. It is trying to process the strings as floats. I am assuming those are just identifying each column, so I would just remove it.

CodePudding user response：

There is no too big value in, at least, your example data for float64. You just need to handle with null values by imputing them with some values because Linear Regression cannot deal with it. For example in line 5 in your example data, there is only 5 values on that where you have six columns according to your first row. This make line 5 will be include NaN when it is read by Pandas.