Home > database >  Why pandas prefers float64 over Int64?
Why pandas prefers float64 over Int64?

Time:01-17

I have a pandas dataframe with a column, with 3 unique values: [0, None, 1] When I run this line:

test_data = test_data.apply(pd.to_numeric, errors='ignore')

the above mentioned column data type is converted to float64

Why not int64? Technically integer type can handle None values, so I'm confused why it didn't pick int64?

Thanks for help,

Edit: As I read about the difference between int64 and Int64, why pandas doesn't choose Int64 then?

CodePudding user response:

The question is misleading, because you actually want to know why pandas does this, not python. As for pandas, it tries to find the smallest numerical dtype, as stated in the documentation (downcast): https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

downcaststr, default None Can be ‘integer’, ‘signed’, ‘unsigned’, or ‘float’. If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules:

‘integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)

‘unsigned’: smallest unsigned int dtype (min.: np.uint8)

‘float’: smallest float dtype (min.: np.float32)

As this behaviour is separate from the core conversion to numeric values, any errors raised during the downcasting will be surfaced regardless of the value of the ‘errors’ input.

In addition, downcasting will only occur if the size of the resulting data’s dtype is strictly larger than the dtype it is to be cast to, so if none of the dtypes checked satisfy that specification, no downcasting will be performed on the data.

with code:

import pandas as pd
df = pd.DataFrame(
    {
        'int':[1,2],
        'float':[1.0,2.0],
        'none': [1, None]
     }
)

you might notice that:

print(df.loc[1, 'none'])
# nan

returns nan, not None this is because pandas uses the numpy library. Numpy offers a value which is treated like a number but isn't actually one and is called: not a number (nan), this value is of type float.

import numpy as np
print(type(np.nan))
# <class 'float'>

Since you used None, pandas tries to find the correct data type, finds numbers with missing values, since it can't handle that it casts it to float where it is able to inser np.nan for the missing values.

  • Related