Pandas loss of precision when loading data with nan values-CodePudding

The issue

I have a csv file containing large integer values that I want to perform some arithmetic operations on, these fields might contain nan values, now when I load these into a df using the pandas to_csv method, when there are no nan values present, these are loaded as 'int' and the precision seems to be correct, but when nan values are present, these get converted to 'float' and I see a precision loss.

The sample csv file ->

,epoch_1,epoch_2
0,1665045912937687151,1665045912937689151
1,,

After loading ->

[1] df = pd.read_csv('sample.csv', index_col=0)

[2] df
        epoch_1       epoch_2
0  1.665046e 18  1.665046e 18
1           NaN           NaN

[3] df['diff'] = df['epoch_2'] - df['epoch_1']

[4] df
        epoch_1       epoch_2    diff
0  1.665046e 18  1.665046e 18  2048.0
1           NaN           NaN     NaN

As you can see the 3rd column has an incorrect value, the correct value should be 2000.

If there are no nan values the result computed is correct.

What I've tried

I've tried specifying the dtype as Int64 while loading the data

[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': pd.Int64Dtype(), 'epoch_2': pd.Int64Dtype()})

[2] df
               epoch_1              epoch_2
0  1665045912937687296  1665045912937689088
1                 <NA>                 <NA>

[3] df['diff'] = df['epoch_2'] - df['epoch_1']

[4] df
               epoch_1              epoch_2  diff
0  1665045912937687296  1665045912937689088  1792
1                 <NA>                 <NA>  <NA>

As you can see this also results in precision loss and inturn an incorrect result.

Workaround I don't want to use

What I can do is load the data as str, remove the NaN columns, and then convert these fields to 'int64' and calculate the result, this gives correct result :

[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': str, 'epoch_2': str})

[2] df
               epoch_1              epoch_2
0  1665045912937687151  1665045912937689151
1                  NaN                  NaN

[3] df = df[~df['epoch_1'].isna()]

[4] df['diff'] = df['epoch_2'].astype(int) - df['epoch_1'].astype(int)

[5] df
               epoch_1              epoch_2  diff
0  1665045912937687151  1665045912937689151  2000

But I need to keep the entries with nan values in the final df, so will have to add those entries back, this method spends a lot of computation just between doing conversions and will be a bottleneck when the size of df & number of fields to compute increase, it also isn't very elegant, so I'm looking for a better way to achieve this.

UPDATE

Another thing that seems to work :-

[1] df = pd.read_csv('sample.csv', index_col=0, dtype=str)

[2] df
               epoch_1              epoch_2
0  1665045912937687151  1665045912937689151
1                  NaN                  NaN

[3] df['diff'] = df['epoch_2'].astype('Int64') - df['epoch_1'].astype('Int64')

[4] df
               epoch_1              epoch_2  diff
0  1665045912937687151  1665045912937689151  2000
1                  NaN                  NaN  <NA>

It seems better than dropping na values and adding them again, though this too requires type conversion before operations which I would like to avoid if possible.

This also raises another doubt that why does it lose precision when specifying the dtype for the columns as Int64 in read_csv, but works correctly when loaded as str and then converting to Int64, does read_csv internally load data as float64 and then converts it to the specified dtype?

CodePudding user response：

yes, it is unfortunate that pandas is not natively supporting its new extension dtypes (like the nullable integer arrays), yet. The work to be done is tracked in https://github.com/pandas-dev/pandas/issues/29752 .

The relevant update for pd.read_csv has just landed in main, ie refer to https://github.com/pandas-dev/pandas/pull/48776 and is scheduled for the next pandas release 1.6.0. (EDIT: The new version due in December has been renamed to 2.0.0, recently).

You can already test it with the nightly scipy wheels.

mamba create -n test_pandas -c conda-forge python pandas pip
mamba activate test_pandas
pip install --pre --upgrade --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas

In [5]: pd.__version__
Out[5]: '1.6.0.dev0 350.g2f7dce4e6e'

In [6]: pd.read_csv("sample.csv", use_nullable_dtypes=True, index_col=0).assign(diff=lambda df: df.epoch_2 - df.epoch_1)
Out[6]:
               epoch_1              epoch_2  diff
0  1665045912937687151  1665045912937689151  2000
1                 <NA>                 <NA>  <NA>

CodePudding user response：

Very interesting, and also very strange. What I came up with is a turnaround that preserves NaN values

def diff(x,y):
    if math.isnan(float(x)) or math.isnan(float(y)):
        return np.nan
    else:
        z = np.int64(y)- np.int64(x)
        return z

df['diff'] = df.apply(lambda x: diff(x['epoch_1'],x['epoch_2']), axis=1)

CodePudding user response：

Interesting that df = pd.read_csv('./file.csv', dtype='Int64') does not work in this case. It is an experimental feature and seems to break here. There seems to be a lot of work ongoing regarding pd.NA and np.nan (e.g. here), so it is likely a bug.

Notice that t = pd.array([1665045912937689151, np.nan], dtype='Int64') fails as well, as it ends up with [1665045912937689088, <NA>]. The problem seems to be the difference between np.nan and pd.NA as s = pd.array([1665045912937689151, pd.NA], dtype='Int64') yields the correct [1665045912937689151, <NA>]. Probably you will have to wait till np.nan is switched to pd.NA in pd.read_csv.

CodePudding user response：

It's a display problem, see docs. Below solution solves both of your problem:

import pandas
df = pandas.read_csv('sample.csv', index_col=0)

print(df) # print with precision 10 by default
print("-----------------------------------------")
with pandas.option_context('display.precision', 20):
     print(df)

Output:

        epoch_1       epoch_2
0  1.665046e 18  1.665046e 18
1           NaN           NaN
-----------------------------------------
                 epoch_1                epoch_2
0  1665045912937687040.0  1665045912937689088.0
1                    NaN                    NaN