data.dropna() doesnt work for my data.csv file and i still get a data with NaN elements-CodePudding

I'm studying Pandas from Python.

I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.

import pandas as pd

data = pd.read_csv('data.csv')

new_data = data.dropna()

print(new_data)

This is data.csv content.

Duration          Date  Pulse  Maxpulse  Calories
      60  '2020/12/01'    110       130     409.1
      60  '2020/12/02'    117       145     479.0
      60  '2020/12/03'    103       135     340.0
      45  '2020/12/04'    109       175     282.4
      45  '2020/12/05'    117       148     406.0
      60  '2020/12/06'    102       127     300.0
      60  '2020/12/07'    110       136     374.0
     450  '2020/12/08'    104       134     253.3
      30  '2020/12/09'    109       133     195.1
      60  '2020/12/10'     98       124     269.0
      60  '2020/12/11'    103       147     329.3
      60  '2020/12/12'    100       120     250.7
      60  '2020/12/12'    100       120     250.7
      60  '2020/12/13'    106       128     345.3
      60  '2020/12/14'    104       132     379.3
      60  '2020/12/15'     98       123     275.0
      60  '2020/12/16'     98       120     215.2
      60  '2020/12/17'    100       120     300.0
      45  '2020/12/18'     90       112       NaN
      60  '2020/12/19'    103       123     323.0
      45  '2020/12/20'     97       125     243.0
      60  '2020/12/21'    108       131     364.2
      45           NaN    100       119     282.0
      60  '2020/12/23'    130       101     300.0
      45  '2020/12/24'    105       132     246.0
      60  '2020/12/25'    102       126     334.5
      60    2020/12/26    100       120     250.0
      60  '2020/12/27'     92       118     241.0
      60  '2020/12/28'    103       132       NaN
      60  '2020/12/29'    100       132     280.0
      60  '2020/12/30'    102       129     380.3
      60  '2020/12/31'     92       115     243.0

My guess is that data.csv is written incorrect?

CodePudding user response：

The data.csv file is written wrong, to fix it need to add commas.

Corrected format: data.csv

Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0

CodePudding user response：

TL,DR: Try this:

new_data = df.fillna(pd.NA).dropna()

import numpy as np new_data = df.fillna(np.NaN).dropna()

That's the real csv file? I don't think so.

There isn't any specification of missing values in csv doc [1]. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).

From pandas doc[2], the pandas.read_csv contains an argument "na_values":

na_values : scalar, str, list-like, or dict, optional

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.

Also, you can use (consider i as the number of row and j for column):

type(df.iloc[i,j])

Compare with:

type(np.NaN) # numpy NaN

float

type(pd.NA) # pandas NaN

pandas._libs.missing.NAType

[1] https://datatracker.ietf.org/doc/html/rfc4180

[2] https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html