I am learning about missing values in Python and came across an article that mentioned NaN.
This is the data with five columns and six rows. I don't know how to attach the data file here. Sorry. five columns: name, age, state, point, and other
name age state point other
Alice 24.0 NY NaN NaN
NaN NaN NaN NaN NaN
Charlie NaN CA NaN NaN
Dave 68.0 TX 70.0 NaN
Ellen NaN CA 88.0 NaN
Frank 30.0 NaN NaN NaN
Here are the two lines of code in the article.
print(df == float('nan'))
print(df = float('nan'))
It says that NaN always returns False for ==, and True for !=. What is this code about? How to understand "float"? Could you please explain this to me?
Thank you very much.
Best,
Sagum
CodePudding user response:
A float
is a 64-bit* IEEE-754 Floating-Point Number.
Most values that can be represented by floats, we can write as literals (e.g. 1.5
or 52.0
). However, the values for Infinity (inf
) and "Not-a-Number" (NaN
), which the IEEE-754 standard requires, can only be reliably assigned by converting those strings into floats. That's what float('nan')
does - it creates a float whose value is NaN
.
The article you read is misleading. According to the IEEE-754 standard, the value NaN
works differently from any other float value - it will always return false when used in any comparison with another number, unless that comparison is !=
, in which case it will always return true. This is even the case when comparing it to itself:
>>> var = float('nan')
>>> var != var
True
as such, using NaN
in comparisons is not very useful.
Instead, to check if a variable contains the value NaN
, do
import math
...
if math.isnan(some_variable):
...
or just
if some_variable == some_variable:
...
since that will always be true if some_variable
is not NaN
, and will always be false if it is NaN
.
The article is probably guiding you to check whether the values in a column of the dataframe are NaN
and use different behavior accordingly.
*in python, at least - in most languages float
is 32 bits and double
is the 64-bit equivalent
CodePudding user response:
NaN (Not A Number) is classically a way to identify missing values. It is a float
type object and can be defined using first('nan')
the fake way that 1.23
can be defined using float('1.23')
. You can also define infinite with float('inf')
.
Some modules directly define a NaN object, which is the case of numpy:
from numpy import nan
If you are working with integers, strings, etc. in pandas, you can also use the type-aware NA with pandas.NA
/pd.NA
.
Further references:
Regarding your question, it is difficult to answer without reading the article, but what you described is incorrect.
print(df == float('nan'))
will always return False
as float('nan')==float('nan')
is False
.
print(df = float('nan'))
will raise a SyntaxError.
The correct way to check for NaN values is isna
df.isna()
Which returns one True
or False
per cell.
To get an aggregated value, you need to combine for example with any
# is there at least one NaN in the whole DataFrame?
df.isna().any().any()