I was trying to build a machine learning following this tutorial which worked fine with the iris dataset, however, when I tried to use my own CSV (for a project), it gave me an error. When I tried to use a different, unrelated method the same thing occurred. (the rest of the details are at the bottom) here is my code:
# Python version
import sys
from sklearn.metrics import make_scorer
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
# Load dataset
url = "energy.csv"
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['YEAR', 'TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)
# Split-out validation dataset
array = dataset.values
X = array[:, 0:8]
y = array[:, 8]
print(y)
my csv:
18,28,564,0,6284.08,1713.84,19.9948,19994.8,19.9948,19.9948
17,28,411,0,6250.42,852.33,0,20740.03,568.22,0
16,27,515,0,6053.3,550.3,0,20361.1,550.3,0
15,24,586,491.72,5408.92,245.86,0,17947.78,491.72,0
14,26,653,533.06,6130.19,0,0,18923.63,1066.12,0
13,26,836,805.08,6172.28,0,0,18785.2,1073.44,0
12,26,073,1303.65,5736.06,0,0,17990.37,1042.92,0
11,27,055,1352.75,6222.65,0,0,18397.4,1082.2,0
10,26,236,1311.8,6034.28,0,0,17578.12,1311.8,0
9,26,020,1821.4,3903,0,0,18994.6,1040.8,260.2
8,26,538,0,4246.08,265.38,13799.76,6369.12,0,1326.9
7,25800,3354,5160,0,0,14964,1290,1032
6,26682,3468.66,5603.22,0,0,14941.92,1600.92,1067.28
5,24997,3499.58,5499.34,0,0,13248.41,1499.82,1249.85
4,25100,3765,4769,0,0,13052,1506,2008
3,24651,4190.67,4930.2,0,0,12325.5,1232.55,1972.08
2,12,053,0,1084.77,0,3133.78,6508.62,0,723.18
1,11,500,2070,2415,0,0,4255,690,2070
when I print y in the last line tho I get this:
[ 19.9948 0. 0. 0. 0. 0. 0.
0. 0. 260.2 1326.9 nan nan nan
nan nan 723.18 2070. ]
which I don't believe is supposed to happen (the 'nan' thing). I'm not super experienced in this area so any direction as to what is going on would really be appreciated, thanks in advance.
CodePudding user response:
Not all rows in your CSV have data on all columns (10). When there's missing data, it is represented as NaN (short for not-a-number).
In [42]: df = pd.read_csv("/tmp/energy.csv", header=None)
In [43]: df
Out[43]:
0 1 2 3 4 5 6 7 8 9
0 18 28 564.00 0.00 6284.08 1713.84 19.9948 19994.80 19.9948 19.9948
1 17 28 411.00 0.00 6250.42 852.33 0.0000 20740.03 568.2200 0.0000
2 16 27 515.00 0.00 6053.30 550.30 0.0000 20361.10 550.3000 0.0000
3 15 24 586.00 491.72 5408.92 245.86 0.0000 17947.78 491.7200 0.0000
4 14 26 653.00 533.06 6130.19 0.00 0.0000 18923.63 1066.1200 0.0000
5 13 26 836.00 805.08 6172.28 0.00 0.0000 18785.20 1073.4400 0.0000
6 12 26 73.00 1303.65 5736.06 0.00 0.0000 17990.37 1042.9200 0.0000
7 11 27 55.00 1352.75 6222.65 0.00 0.0000 18397.40 1082.2000 0.0000
8 10 26 236.00 1311.80 6034.28 0.00 0.0000 17578.12 1311.8000 0.0000
9 9 26 20.00 1821.40 3903.00 0.00 0.0000 18994.60 1040.8000 260.2000
10 8 26 538.00 0.00 4246.08 265.38 13799.7600 6369.12 0.0000 1326.9000
11 7 25800 3354.00 5160.00 0.00 0.00 14964.0000 1290.00 1032.0000 NaN
12 6 26682 3468.66 5603.22 0.00 0.00 14941.9200 1600.92 1067.2800 NaN
13 5 24997 3499.58 5499.34 0.00 0.00 13248.4100 1499.82 1249.8500 NaN
14 4 25100 3765.00 4769.00 0.00 0.00 13052.0000 1506.00 2008.0000 NaN
15 3 24651 4190.67 4930.20 0.00 0.00 12325.5000 1232.55 1972.0800 NaN
16 2 12 53.00 0.00 1084.77 0.00 3133.7800 6508.62 0.0000 723.1800
17 1 11 500.00 2070.00 2415.00 0.00 0.0000 4255.00 690.0000 2070.0000
See rows 11-15. You can find more about NaNs in pandas in this FAQ.
Edit: Maybe it's difficult to see with the whole dataframe. Below I only show the 10 onwards, and column 5 onwards.
In [57]: df.iloc[10:, 5:]
Out[57]:
5 6 7 8 9
10 265.38 13799.76 6369.12 0.00 1326.90
11 0.00 14964.00 1290.00 1032.00 NaN
12 0.00 14941.92 1600.92 1067.28 NaN
13 0.00 13248.41 1499.82 1249.85 NaN
14 0.00 13052.00 1506.00 2008.00 NaN
15 0.00 12325.50 1232.55 1972.08 NaN
16 0.00 3133.78 6508.62 0.00 723.18
17 0.00 0.00 4255.00 690.00 2070.00
In [58]: df.iloc[11:16, 9]
Out[58]:
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
Name: 9, dtype: float64
In [59]: df.iloc[11:16, 9].isna()
Out[59]:
11 True
12 True
13 True
14 True
15 True
Name: 9, dtype: bool