Numpy matrix multiplication issue with 20 elements-CodePudding

I am using a matrix multiplication method to retrieve the position of True and False into an array; this is necessary because I cannot use a for look (I have thousands of records). The procedure is the following:

import numpy as np
# Create a test array
test_array = np.array([[False, True, False, False, False, True]])
# Create a set of unique "tens", each one identifying a position
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
# Multiply the matrix
print(int(np.dot(test_array, uniq_tens)[0]))
100010

The 10010 must be read from right to left (0=False, 1=True, 0=False, 0=False, 1=True). Everything works fine except if the test_array is of 20 elements.

# This works fine - Test with 21 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
21
111000000000000000010

# This works fine - Test with 19 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
19
1000000000000000010

# This does not work - Test with 20 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
20
10000000000000000000

I tested with numpy version 1.16.4/1.19.4 and 1.19.5. Could you please help me in understanding why? I am worried it can happen also with other numbers, not only 20.

Thanks a lot for your help!

CodePudding user response：

You are hitting int64 limit:

print(len(str(2 ** (64 - 1))))
# 19

when computing uniq_tens.

More precisely, what happens here is that:

uniq_tens content is Python's int, which is arbitrary precision
when you call np.dot() the uniq_tens list is converted to a NumPy array, with unspecified data type
- when the maximum value is up until np.iinfo(np.int64).max the datatype is inferred to be int64
- when the maximum value is up between np.iinfo(np.int64).max and np.iinfo(np.uint64).max the datatype is inferred to be uint64
- when the maximum value is above that it retains the Python object and falls back to arbitrary precision
There might be an extra cast in np.dot() if the inputs are of mixed dtype. In the case of np.bool_ and np.uint64 the inferred common type is np.float64.

Now:

max_int64 = np.iinfo(np.int64).max
print(max_int64, len(str(max_int64)))
# 9223372036854775807 19

max_uint64 = np.iinfo(np.uint64).max
print(max_uint64, len(str(max_uint64)))
# 18446744073709551615 20

print(repr(np.array([max_int64])))
# array([9223372036854775807])
print(repr(np.array([max_uint64])))
# array([18446744073709551615], dtype=uint64)
print(repr(np.array([max_uint64   1])))
# array([18446744073709551616], dtype=object)

So, up until 19 and above 21 everything works well. When you use 20, it does convert to uint64. However, when you use np.dot() it realizes it can no longer use int64 nor uint64 to hold the result and casts everything to np.float64.

Instead, when you start with something that is already a long int it keeps using that:

print(np.dot([1], [max_int64]))
# 9223372036854775807
print(np.dot([1], [max_uint64]))
# 1.8446744073709552e 19
print(np.dot([1], [max_uint64   1]))
# 18446744073709551616

CodePudding user response：

I have tested your code and indeed it looks like the error is caused by a floating point precision obtained after the np.dot function. You might convert it back to int, but since you have a float as an intermediate step, a conversion goes poorly. Also, the fact that it works for lengths 18 and 19 is a pure coincidence - I have tested it for other test_arrays and got errors there.

In fact, I believe this is rather fortunate, because your solution wont work for larger numbers. Below you can find a one-liner which solves your problem and should work for arbitrarily large arrays:

int(''.join(reversed(test_array.as_type(int).astype(str).flatten())))

What happens to the test_array here:

convert to int to get zeroes and ones
convert to str since we will want to concatenate
flatten the array to make it 1D (or use a 1D input instead)
reverse the contents using reversed
concatenate all the individual '0' and '1' strings
convert the output back to int

CodePudding user response：

For best results for your use-case, I think best output would be:

int( 
    np.binary_repr( 
                   (2 ** np.where(test_array)[1]).sum()
                  ) 
   )

(multi-line for clarity, since there's a lot of nested parentheses there)

np.binary_repr() returns a string that can be then cast to int directly, skipping over many of the casting problems.