Problem in Pandas : impossible to do sum of int with arbitrary precision-CodePudding

I tried to do the sum of large integers in pandas and the answer is not as expected.

Input file : my_file_lg_int

my_int
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222

Python code

file = 'my_file_lg_int'
data = pd.read_csv(file)
data['my_int'].sum()

The output is :

111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222

As integers are too long, they are not integers but strings. So I tried data = pd.read_csv(file,dtype = {'my_int': int}) but I have an overflow error. How could I solve it ?

CodePudding user response：

Perhaps:

df["my_int"].apply(int).sum()

CodePudding user response：

Many tasks are easier without hauling in the enormous pandas and numpy modules.

filename = 'my_file_lg_int'
mysum = sum( int(k.rstrip()) for k in open(filename) )

CodePudding user response：

We can use decimal module to solve this. According to the documentation:

Unlike hardware based binary floating point, the decimal module has a user alterable precision (defaulting to 28 places) which can be as large as needed for a given problem:

Since the number in any given row here has 102 digits, we can choose to set the precision to 103 digits. This method will not, however, work if a number in any row has more than 103 digits.

import pandas as pd 
import decimal
from decimal import Decimal

decimal.setcontext(decimal.Context(prec=103))

df = pd.read_csv(file, dtype={"my_int": Decimal})
x = Decimal("0")

for i in df['my_int']:
    x = x   Decimal(i)

print(x)
print(type(x))

This gives:

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 
<class 'decimal.Decimal'>