Method to manage "NAN" (in capital letters) with Pandas?-CodePudding

do you know if there is a way to manage the "NAN" all in capital letters present in a data file with Pandas?

I have some data files have this format:

"2020-08-14 14:00:00",10,154.9554,153.6879,154.3988,158.5282,"NAN","NAN",158.43,"NAN",155.2103

.isnull() and .isna() functions don't handle when "NAN" is capitalized but handle it when it is written this way "NaN" or "nan".

Thank you in advance, I looked for other topics but nothing on this specific subject.

CodePudding user response：

isnull and isna do NOT return True for strings, no matter the case.

Most likely you have a mix of real NaN and of strings:

s = pd.Series([float('nan'), 'NAN', 'nan', 'NaN'])
df = pd.concat({'s': s, 'isnull': s.isnull(), 'isna': s.isna()}, axis=1)

output:

     s  isnull   isna
0  NaN    True   True
1  NAN   False  False
2  nan   False  False
3  NaN   False  False

Now, by default, read_csv recognizes the following strings as NaN:

'', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN',
'-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A',
'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'

You can add "NAN" with the na_values option:

df = pd.read_csv(..., na_values=['NAN'])

CodePudding user response：

You might simply use .replace as follows

import pandas as pd
df = pd.DataFrame({"x":[1.5,2.5,"NAN",3.5,4.5,"NAN",6.5,7.5]})
print(df.x.mean())  # TypeError: unsupported operand type(s) for  : 'float' and 'str'
df.replace("NAN",float("nan"),inplace=True)
print(df.x.mean())  # 4.333333333333333

or if you wish to have new pandas.DataFrame with NAN-string replaced with "true" NaNs

df2 = df.replace("NAN",float("nan"))

CodePudding user response：

Try .replace() to assign nan values, and then you can use them as proper nan:

df["column"].replace({"NAN": np.nan})