The issue
I have a csv file containing large integer values that I want to perform some arithmetic operations on, these fields might contain nan values, now when I load these into a df using the pandas to_csv method, when there are no nan values present, these are loaded as 'int' and the precision seems to be correct, but when nan values are present, these get converted to 'float' and I see a precision loss.
The sample csv file ->
,epoch_1,epoch_2
0,1665045912937687151,1665045912937689151
1,,
After loading ->
[1] df = pd.read_csv('sample.csv', index_col=0)
[2] df
epoch_1 epoch_2
0 1.665046e 18 1.665046e 18
1 NaN NaN
[3] df['diff'] = df['epoch_2'] - df['epoch_1']
[4] df
epoch_1 epoch_2 diff
0 1.665046e 18 1.665046e 18 2048.0
1 NaN NaN NaN
As you can see the 3rd column has an incorrect value, the correct value should be 2000.
If there are no nan values the result computed is correct.
What I've tried
I've tried specifying the dtype as Int64 while loading the data
[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': pd.Int64Dtype(), 'epoch_2': pd.Int64Dtype()})
[2] df
epoch_1 epoch_2
0 1665045912937687296 1665045912937689088
1 <NA> <NA>
[3] df['diff'] = df['epoch_2'] - df['epoch_1']
[4] df
epoch_1 epoch_2 diff
0 1665045912937687296 1665045912937689088 1792
1 <NA> <NA> <NA>
As you can see this also results in precision loss and inturn an incorrect result.
Workaround I don't want to use
What I can do is load the data as str, remove the NaN columns, and then convert these fields to 'int64' and calculate the result, this gives correct result :
[1] df = pd.read_csv('sample.csv', index_col=0, dtype={'epoch_1': str, 'epoch_2': str})
[2] df
epoch_1 epoch_2
0 1665045912937687151 1665045912937689151
1 NaN NaN
[3] df = df[~df['epoch_1'].isna()]
[4] df['diff'] = df['epoch_2'].astype(int) - df['epoch_1'].astype(int)
[5] df
epoch_1 epoch_2 diff
0 1665045912937687151 1665045912937689151 2000
But I need to keep the entries with nan values in the final df, so will have to add those entries back, this method spends a lot of computation just between doing conversions and will be a bottleneck when the size of df & number of fields to compute increase, it also isn't very elegant, so I'm looking for a better way to achieve this.
UPDATE
Another thing that seems to work :-
[1] df = pd.read_csv('sample.csv', index_col=0, dtype=str)
[2] df
epoch_1 epoch_2
0 1665045912937687151 1665045912937689151
1 NaN NaN
[3] df['diff'] = df['epoch_2'].astype('Int64') - df['epoch_1'].astype('Int64')
[4] df
epoch_1 epoch_2 diff
0 1665045912937687151 1665045912937689151 2000
1 NaN NaN <NA>
It seems better than dropping na values and adding them again, though this too requires type conversion before operations which I would like to avoid if possible.
This also raises another doubt that why does it lose precision when specifying the dtype for the columns as Int64 in read_csv, but works correctly when loaded as str and then converting to Int64, does read_csv internally load data as float64 and then converts it to the specified dtype?
CodePudding user response:
yes, it is unfortunate that pandas is not natively supporting its new extension dtypes (like the nullable integer arrays), yet. The work to be done is tracked in https://github.com/pandas-dev/pandas/issues/29752 .
The relevant update for pd.read_csv
has just landed in main
, ie refer to https://github.com/pandas-dev/pandas/pull/48776 and is scheduled for the next pandas release 1.6.0
. (EDIT: The new version due in December has been renamed to 2.0.0
, recently).
You can already test it with the nightly scipy wheels.
mamba create -n test_pandas -c conda-forge python pandas pip
mamba activate test_pandas
pip install --pre --upgrade --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple pandas
In [5]: pd.__version__
Out[5]: '1.6.0.dev0 350.g2f7dce4e6e'
In [6]: pd.read_csv("sample.csv", use_nullable_dtypes=True, index_col=0).assign(diff=lambda df: df.epoch_2 - df.epoch_1)
Out[6]:
epoch_1 epoch_2 diff
0 1665045912937687151 1665045912937689151 2000
1 <NA> <NA> <NA>
CodePudding user response:
Very interesting, and also very strange. What I came up with is a turnaround that preserves NaN
values
def diff(x,y):
if math.isnan(float(x)) or math.isnan(float(y)):
return np.nan
else:
z = np.int64(y)- np.int64(x)
return z
df['diff'] = df.apply(lambda x: diff(x['epoch_1'],x['epoch_2']), axis=1)
CodePudding user response:
Interesting that df = pd.read_csv('./file.csv', dtype='Int64')
does not work in this case. It is an experimental feature and seems to break here. There seems to be a lot of work ongoing regarding pd.NA
and np.nan
(e.g. here), so it is likely a bug.
Notice that t = pd.array([1665045912937689151, np.nan], dtype='Int64')
fails as well, as it ends up with [1665045912937689088, <NA>]
. The problem seems to be the difference between np.nan
and pd.NA
as s = pd.array([1665045912937689151, pd.NA], dtype='Int64')
yields the correct [1665045912937689151, <NA>]
. Probably you will have to wait till np.nan
is switched to pd.NA
in pd.read_csv
.
CodePudding user response:
It's a display problem, see docs. Below solution solves both of your problem:
import pandas
df = pandas.read_csv('sample.csv', index_col=0)
print(df) # print with precision 10 by default
print("-----------------------------------------")
with pandas.option_context('display.precision', 20):
print(df)
Output:
epoch_1 epoch_2
0 1.665046e 18 1.665046e 18
1 NaN NaN
-----------------------------------------
epoch_1 epoch_2
0 1665045912937687040.0 1665045912937689088.0
1 NaN NaN