How to understand float('nan')? What is the use of float?-CodePudding

I am learning about missing values in Python and came across an article that mentioned NaN.

This is the data with five columns and six rows. I don't know how to attach the data file here. Sorry. five columns: name, age, state, point, and other

     name   age state  point  other
    Alice  24.0    NY    NaN    NaN
      NaN   NaN   NaN    NaN    NaN
  Charlie   NaN    CA    NaN    NaN
     Dave  68.0    TX   70.0    NaN
    Ellen   NaN    CA   88.0    NaN
    Frank  30.0   NaN    NaN    NaN

Here are the two lines of code in the article.

print(df == float('nan'))
print(df = float('nan'))

It says that NaN always returns False for ==, and True for !=. What is this code about? How to understand "float"? Could you please explain this to me?

Thank you very much.

Best,

Sagum

CodePudding user response：

A float is a 64-bit* IEEE-754 Floating-Point Number.

Most values that can be represented by floats, we can write as literals (e.g. 1.5 or 52.0). However, the values for Infinity (inf) and "Not-a-Number" (NaN), which the IEEE-754 standard requires, can only be reliably assigned by converting those strings into floats. That's what float('nan') does - it creates a float whose value is NaN.

The article you read is misleading. According to the IEEE-754 standard, the value NaN works differently from any other float value - it will always return false when used in any comparison with another number, unless that comparison is !=, in which case it will always return true. This is even the case when comparing it to itself:

>>> var = float('nan')
>>> var != var
True

as such, using NaN in comparisons is not very useful.

Instead, to check if a variable contains the value NaN, do

import math
...
if math.isnan(some_variable):
    ...

or just

if some_variable == some_variable:
    ...

since that will always be true if some_variable is not NaN, and will always be false if it is NaN.

The article is probably guiding you to check whether the values in a column of the dataframe are NaN and use different behavior accordingly.

*in python, at least - in most languages float is 32 bits and double is the 64-bit equivalent

CodePudding user response：

NaN (Not A Number) is classically a way to identify missing values. It is a float type object and can be defined using first('nan') the fake way that 1.23 can be defined using float('1.23'). You can also define infinite with float('inf').

Some modules directly define a NaN object, which is the case of numpy:

from numpy import nan

If you are working with integers, strings, etc. in pandas, you can also use the type-aware NA with pandas.NA/pd.NA.

Further references:

Regarding your question, it is difficult to answer without reading the article, but what you described is incorrect.

print(df == float('nan')) will always return False as float('nan')==float('nan') is False.

print(df = float('nan')) will raise a SyntaxError.

The correct way to check for NaN values is isna

df.isna()

Which returns one True or False per cell.

To get an aggregated value, you need to combine for example with any

# is there at least one NaN in the whole DataFrame?
df.isna().any().any()