Home > other >  Problem with apply(int) to convert string to int in pandas
Problem with apply(int) to convert string to int in pandas

Time:12-24

This question follows this one : Problem in Pandas : impossible to do sum of int with arbitrary precision

I used this simple answer df["my_int"].apply(int).sum() (https://stackoverflow.com/a/74903334/7462275)

But it does not work in all cases.

For example, with this file

my_int
9220426963983292163
5657924282683240

The ouput is -9220659185443576213

After looking at the apply(int) output, I understand the problem. In this case, apply(int) returns dtype:int64.

0    9220426963983292163
1       5657924282683240
Name: my_int, dtype: int64

But with large numbers, it returns dtype:object

0    1111111111111111111111111111111111111111111111...
1    2222222222222222222222222222222222222222222222...
Name: my_int, dtype: object

Is it possible to solve it with pandas ? Or should I follow the Tim Roberts answer (https://stackoverflow.com/a/74903381/7462275) ?

edit1

Awful solution. A line is added to the end of the file with a large integer

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

And after, sum is done on all lines except the last one :

data['my_int'].apply(int).iloc[:-1].sum()

CodePudding user response:

Solution using Pandas:

sum(data[`my_int`].apply(int).to_list())

Why do I say so?

df1:

my_int
9220426963983292163
5657924282683240

df2:

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

Let S1 and S2 denote the sum of elements in the column my_int in df1 and df2, respectively:

S1 = 9226084888265975403
S2 = 11111111111111111111111111111111111111111111111111111120337195999377086514

If we check the documentation of NumPy on Overflow errors, we see that NumPy offers limited precision:

>>> np.iinfo(int)
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

The max number representable is smaller than both S1 and S2.

The solution works without any trouble and gives the correct sum:

>>> df2['my_int'].sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>> 
>>> df2['my_int'].astype(object).sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>>
>>> sum(df2['my_int'].apply(int).to_list())
11111111111111111111111111111111111111111111120337195999377086514

EDIT: Prefer sum over np.sum:

>>> np.sum(df1['my_int'].apply(int).to_list())
>>> -9220659185443576213
>>> sum(df1['my_int'].apply(int).to_list())
>>> 9226084888265975403

Source of the calculation for sum of elements in the column my_int is WolframAlpha: df1, df2

CodePudding user response:

Solution :

df["my_int"].apply(int).astype(object).sum()

apply(int): To avoid string concatenation with large numbers.

astype(object): To convert int64 to object.

  • Related