So I'm trying to write a piece of code that can predict the "pr10tournaments" from the csv data. I am running into an error that says
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
Here is the code
from os import sep
import sklearn
import tensorflow
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from matplotlib import style
from sklearn import linear_model
from sklearn.utils import shuffle
data = pd.read_csv("pr.csv", sep=";")
data = data[["pr10tournaments", "pr10tournamentsv2", "pr10tournamentsv3", "pr10tournamentsv4", "pr10tournamentsv5"]]
predict = "pr10tournaments"
x = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test,y_test)
print(acc)
Here is my csv file
playername;pr10tournaments;pr10tournamentsv2;pr10tournamentsv3;pr10tournamentsv4;pr10tournamentsv5;
"REET";"11410";"4680";"18482";"2345";7175
"Cented";"16225";"8122";"16445";"12740";5897
"Deyy";"10995";"9187";"21375";"6180";13862
"Edgey";"22150";"7087";"17612";5792
I am new to machine learning and I'm not certain but believe the error is with the numbers being too large and if that is the case is there some sort of way around this? Thanks.
CodePudding user response:
The issue is because of the first line of your csv file. It is trying to process the strings as floats. I am assuming those are just identifying each column, so I would just remove it.
CodePudding user response:
There is no too big value in, at least, your example data for float64
. You just need to handle with null values by imputing them with some values because Linear Regression cannot deal with it. For example in line 5 in your example data, there is only 5 values on that where you have six columns according to your first row. This make line 5 will be include NaN
when it is read by Pandas
.