Home > Blockchain >  Numpy multiplication using * (asterisk) returning wrong values when using named variables
Numpy multiplication using * (asterisk) returning wrong values when using named variables

Time:08-09

I am running into a problem using the operator * with numpy scalars, and it would be great if someone can explain what is going on.

Basically, I needed to multiply the sums of columns and rows from various dataframes, and the easiest way to do that was to assign each aggregate to a variable, and then multiply those variables together.

The following block of code demonstrates the problem:

#define dictionary, four columns a-d, five rows with progressively larger values
mydict = [{"a":10,     "b":20,     "c": 30,     "d": 40}, 
          {"a":100,    "b":200,    "c": 300,    "d": 400}, 
          {"a":1000,   "b":2000,   "c": 3000,   "d": 4000}, 
          {"a":10000,  "b":20000,  "c": 30000,  "d": 40000}, 
          {"a":100000, "b":200000, "c": 300000, "d": 400000}] 

#create dataframe
df = pd.DataFrame(mydict)

#assign sum of each column to variable
a_sum = df.iloc[:,0].sum()
b_sum = df.iloc[:,1].sum()
c_sum = df.iloc[:,2].sum()
d_sum = df.iloc[:,3].sum()

print(a_sum, b_sum, c_sum, d_sum)
print(type(a_sum))

# output is: 
#111110 222220 333330 444440
#<class 'numpy.int64'>

Then, I multiply the resulting sums using both hardcoded and variable approaches and receive two different results:

#copy-pasted column sums from output above, multiply together
no_vars = 111110 * 222220 * 333330 * 444440

#multiply variables together (should be identical to line above)
with_vars = a_sum * b_sum * c_sum * d_sum

#compare the outputs, expect the results to be 1 here
print(no_vars/with_vars)

#output is 
#680.233

I'm guessing this has something to do with how numpy treats the * operator, but I have not been able to find a definitive explanation about what is going on and how to avoid this problem.

Note that the following workaround that removes numpy from the question returns 1 as expected:

no_vars = 111110 * 222220 * 333330 * 444440

with_vars = int(a_sum) * int(b_sum) * int(c_sum) * int(d_sum)

print(no_vars/with_vars)

Thanks in advance!

CodePudding user response:

Here is a much simpler example which should trivially illustrate the issue:

>>> a = np.uint32(0xFFFFFFFF)   np.uint32(1)
>>> b = 0xFFFFFFFF   1
>>> print(a, b)
<ipython-input-3-1b01bc70582a>:1: RuntimeWarning: overflow encountered in uint_scalars
  a = np.uint32(0xFFFFFFFF)   np.uint32(1)
0 4294967296

Numpy scalars can only store as many bits as the type allows. Python integers can hold an effectively infinite number of bits. Depending on what you want to acheive, you will be able to design different approaches.

CodePudding user response:

The problem is that you are using fixed width integers (int64) that are capped in the minimum and maximum values they can hold, and you are trying to represent a number larger than what can be represented (integer overflow). You could either use variable size integers (like big int that Python uses) or you could switch to floats which trade off some precision for larger minimum and maximum values they can represent.

Practically, you can just force the _sum variables to be treated as float before overflowing:

a_sum = a_sum.astype(np.float_)

With this you can observe that the following:

no_vars = 111110 * 222220 * 333330 * 444440

a_sum = a_sum.astype(np.float_)

with_vars = a_sum * b_sum * c_sum * d_sum

print(no_vars/with_vars)

will print a value of 1.0.

Note that such apparently exact result is a result of this specific calculation and how numbers get converted.

In general, results obtained with float arithmetic and big int arithmetic will be different, e.g.:

print(no_vars)
# 3657832649657049840000
print(with_vars)
# 3.6578326496570497e 21
print(float(no_vars))
# 3.6578326496570497e 21
print(int(with_vars))
# 3657832649657049677824
print(no_vars == with_vars)
# False
print(float(no_vars) == with_vars)
# True
print(no_vars == int(with_vars))
# False
  • Related