I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
CodePudding user response:
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
CodePudding user response:
TL,DR: Try this:
new_data = df.fillna(pd.NA).dropna()
or
import numpy as np new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc [1]. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc[2], the pandas.read_csv contains an argument "na_values":
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType
[1] https://datatracker.ietf.org/doc/html/rfc4180
[2] https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html