This question follows this one : Problem in Pandas : impossible to do sum of int with arbitrary precision
I used this simple answer df["my_int"].apply(int).sum()
(https://stackoverflow.com/a/74903334/7462275)
But it does not work in all cases.
For example, with this file
my_int
9220426963983292163
5657924282683240
The ouput is -9220659185443576213
After looking at the apply(int)
output, I understand the problem. In this case, apply(int) returns dtype:int64
.
0 9220426963983292163
1 5657924282683240
Name: my_int, dtype: int64
But with large numbers, it returns dtype:object
0 1111111111111111111111111111111111111111111111...
1 2222222222222222222222222222222222222222222222...
Name: my_int, dtype: object
Is it possible to solve it with pandas ? Or should I follow the Tim Roberts answer (https://stackoverflow.com/a/74903381/7462275) ?
edit1
Awful solution. A line is added to the end of the file with a large integer
my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111
And after, sum is done on all lines except the last one :
data['my_int'].apply(int).iloc[:-1].sum()
CodePudding user response:
Solution using Pandas:
sum(data[`my_int`].apply(int).to_list())
Why do I say so?
df1
:
my_int
9220426963983292163
5657924282683240
df2
:
my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111
Let S1 and S2 denote the sum of elements in the column my_int
in df1
and df2
, respectively:
S1 = 9226084888265975403
S2 = 11111111111111111111111111111111111111111111111111111120337195999377086514
If we check the documentation of NumPy on Overflow errors, we see that NumPy offers limited precision:
>>> np.iinfo(int)
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)
The max number representable is smaller than both S1 and S2.
The solution works without any trouble and gives the correct sum:
>>> df2['my_int'].sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>>
>>> df2['my_int'].astype(object).sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>>
>>> sum(df2['my_int'].apply(int).to_list())
11111111111111111111111111111111111111111111120337195999377086514
EDIT:
Prefer sum
over np.sum
:
>>> np.sum(df1['my_int'].apply(int).to_list())
>>> -9220659185443576213
>>> sum(df1['my_int'].apply(int).to_list())
>>> 9226084888265975403
Source of the calculation for sum of elements in the column my_int
is WolframAlpha: df1, df2
CodePudding user response:
Solution :
df["my_int"].apply(int).astype(object).sum()
apply(int)
: To avoid string concatenation with large numbers.
astype(object)
: To convert int64
to object
.