CSV causing problems with 'nan' when formatting. What should I do?-CodePudding

I was trying to build a machine learning following this tutorial which worked fine with the iris dataset, however, when I tried to use my own CSV (for a project), it gave me an error. When I tried to use a different, unrelated method the same thing occurred. (the rest of the details are at the bottom) here is my code:

# Python version
import sys

from sklearn.metrics import make_scorer

print('Python: {}'.format(sys.version))
# scipy
import scipy

print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy

print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib

print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas

print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn

print('sklearn: {}'.format(sklearn.__version__))

# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFE

# Load dataset
url = "energy.csv"
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['YEAR', 'TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)

# Split-out validation dataset
array = dataset.values
X = array[:, 0:8]
y = array[:, 8]


print(y)

my csv:

18,28,564,0,6284.08,1713.84,19.9948,19994.8,19.9948,19.9948
17,28,411,0,6250.42,852.33,0,20740.03,568.22,0
16,27,515,0,6053.3,550.3,0,20361.1,550.3,0
15,24,586,491.72,5408.92,245.86,0,17947.78,491.72,0
14,26,653,533.06,6130.19,0,0,18923.63,1066.12,0
13,26,836,805.08,6172.28,0,0,18785.2,1073.44,0
12,26,073,1303.65,5736.06,0,0,17990.37,1042.92,0
11,27,055,1352.75,6222.65,0,0,18397.4,1082.2,0
10,26,236,1311.8,6034.28,0,0,17578.12,1311.8,0
9,26,020,1821.4,3903,0,0,18994.6,1040.8,260.2
8,26,538,0,4246.08,265.38,13799.76,6369.12,0,1326.9
7,25800,3354,5160,0,0,14964,1290,1032
6,26682,3468.66,5603.22,0,0,14941.92,1600.92,1067.28
5,24997,3499.58,5499.34,0,0,13248.41,1499.82,1249.85
4,25100,3765,4769,0,0,13052,1506,2008
3,24651,4190.67,4930.2,0,0,12325.5,1232.55,1972.08
2,12,053,0,1084.77,0,3133.78,6508.62,0,723.18
1,11,500,2070,2415,0,0,4255,690,2070

when I print y in the last line tho I get this:

[  19.9948    0.        0.        0.        0.        0.        0.
    0.        0.      260.2    1326.9          nan       nan       nan
       nan       nan  723.18   2070.    ]

which I don't believe is supposed to happen (the 'nan' thing). I'm not super experienced in this area so any direction as to what is going on would really be appreciated, thanks in advance.

CodePudding user response：

Not all rows in your CSV have data on all columns (10). When there's missing data, it is represented as NaN (short for not-a-number).

In [42]: df = pd.read_csv("/tmp/energy.csv", header=None)
In [43]: df
Out[43]: 
     0      1        2        3        4        5           6         7          8          9
0   18     28   564.00     0.00  6284.08  1713.84     19.9948  19994.80    19.9948    19.9948
1   17     28   411.00     0.00  6250.42   852.33      0.0000  20740.03   568.2200     0.0000
2   16     27   515.00     0.00  6053.30   550.30      0.0000  20361.10   550.3000     0.0000
3   15     24   586.00   491.72  5408.92   245.86      0.0000  17947.78   491.7200     0.0000
4   14     26   653.00   533.06  6130.19     0.00      0.0000  18923.63  1066.1200     0.0000
5   13     26   836.00   805.08  6172.28     0.00      0.0000  18785.20  1073.4400     0.0000
6   12     26    73.00  1303.65  5736.06     0.00      0.0000  17990.37  1042.9200     0.0000
7   11     27    55.00  1352.75  6222.65     0.00      0.0000  18397.40  1082.2000     0.0000
8   10     26   236.00  1311.80  6034.28     0.00      0.0000  17578.12  1311.8000     0.0000
9    9     26    20.00  1821.40  3903.00     0.00      0.0000  18994.60  1040.8000   260.2000
10   8     26   538.00     0.00  4246.08   265.38  13799.7600   6369.12     0.0000  1326.9000
11   7  25800  3354.00  5160.00     0.00     0.00  14964.0000   1290.00  1032.0000        NaN
12   6  26682  3468.66  5603.22     0.00     0.00  14941.9200   1600.92  1067.2800        NaN
13   5  24997  3499.58  5499.34     0.00     0.00  13248.4100   1499.82  1249.8500        NaN
14   4  25100  3765.00  4769.00     0.00     0.00  13052.0000   1506.00  2008.0000        NaN
15   3  24651  4190.67  4930.20     0.00     0.00  12325.5000   1232.55  1972.0800        NaN
16   2     12    53.00     0.00  1084.77     0.00   3133.7800   6508.62     0.0000   723.1800
17   1     11   500.00  2070.00  2415.00     0.00      0.0000   4255.00   690.0000  2070.0000

See rows 11-15. You can find more about NaNs in pandas in this FAQ.

Edit: Maybe it's difficult to see with the whole dataframe. Below I only show the 10 onwards, and column 5 onwards.

In [57]: df.iloc[10:, 5:]
Out[57]: 
         5         6        7        8        9
10  265.38  13799.76  6369.12     0.00  1326.90
11    0.00  14964.00  1290.00  1032.00      NaN
12    0.00  14941.92  1600.92  1067.28      NaN
13    0.00  13248.41  1499.82  1249.85      NaN
14    0.00  13052.00  1506.00  2008.00      NaN
15    0.00  12325.50  1232.55  1972.08      NaN
16    0.00   3133.78  6508.62     0.00   723.18
17    0.00      0.00  4255.00   690.00  2070.00
In [58]: df.iloc[11:16, 9]
Out[58]: 
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
Name: 9, dtype: float64

In [59]: df.iloc[11:16, 9].isna()
Out[59]: 
11    True
12    True
13    True
14    True
15    True
Name: 9, dtype: bool