I am using a matrix multiplication method to retrieve the position of True and False into an array; this is necessary because I cannot use a for look (I have thousands of records). The procedure is the following:
import numpy as np
# Create a test array
test_array = np.array([[False, True, False, False, False, True]])
# Create a set of unique "tens", each one identifying a position
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
# Multiply the matrix
print(int(np.dot(test_array, uniq_tens)[0]))
100010
The 10010 must be read from right to left (0=False, 1=True, 0=False, 0=False, 1=True). Everything works fine except if the test_array is of 20 elements.
# This works fine - Test with 21 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
21
111000000000000000010
# This works fine - Test with 19 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
19
1000000000000000010
# This does not work - Test with 20 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
20
10000000000000000000
I tested with numpy version 1.16.4/1.19.4 and 1.19.5. Could you please help me in understanding why? I am worried it can happen also with other numbers, not only 20.
Thanks a lot for your help!
CodePudding user response:
You are hitting int64 limit:
print(len(str(2 ** (64 - 1))))
# 19
when computing uniq_tens
.
More precisely, what happens here is that:
uniq_tens
content is Python'sint
, which is arbitrary precision- when you call
np.dot()
theuniq_tens
list is converted to a NumPy array, with unspecified data type- when the maximum value is up until
np.iinfo(np.int64).max
the datatype is inferred to beint64
- when the maximum value is up between
np.iinfo(np.int64).max
andnp.iinfo(np.uint64).max
the datatype is inferred to beuint64
- when the maximum value is above that it retains the Python object and falls back to arbitrary precision
- when the maximum value is up until
- There might be an extra cast in
np.dot()
if the inputs are of mixed dtype. In the case ofnp.bool_
andnp.uint64
the inferred common type isnp.float64
.
Now:
max_int64 = np.iinfo(np.int64).max
print(max_int64, len(str(max_int64)))
# 9223372036854775807 19
max_uint64 = np.iinfo(np.uint64).max
print(max_uint64, len(str(max_uint64)))
# 18446744073709551615 20
print(repr(np.array([max_int64])))
# array([9223372036854775807])
print(repr(np.array([max_uint64])))
# array([18446744073709551615], dtype=uint64)
print(repr(np.array([max_uint64 1])))
# array([18446744073709551616], dtype=object)
So, up until 19 and above 21 everything works well.
When you use 20, it does convert to uint64
.
However, when you use np.dot()
it realizes it can no longer use int64
nor uint64
to hold the result and casts everything to np.float64
.
Instead, when you start with something that is already a long int it keeps using that:
print(np.dot([1], [max_int64]))
# 9223372036854775807
print(np.dot([1], [max_uint64]))
# 1.8446744073709552e 19
print(np.dot([1], [max_uint64 1]))
# 18446744073709551616
CodePudding user response:
I have tested your code and indeed it looks like the error is caused by a floating point precision obtained after the np.dot
function. You might convert it back to int, but since you have a float as an intermediate step, a conversion goes poorly. Also, the fact that it works for lengths 18 and 19 is a pure coincidence - I have tested it for other test_arrays and got errors there.
In fact, I believe this is rather fortunate, because your solution wont work for larger numbers. Below you can find a one-liner which solves your problem and should work for arbitrarily large arrays:
int(''.join(reversed(test_array.as_type(int).astype(str).flatten())))
What happens to the test_array here:
- convert to
int
to get zeroes and ones - convert to
str
since we will want to concatenate - flatten the array to make it 1D (or use a 1D input instead)
- reverse the contents using
reversed
- concatenate all the individual
'0'
and'1'
strings - convert the output back to
int
CodePudding user response:
For best results for your use-case, I think best output would be:
int(
np.binary_repr(
(2 ** np.where(test_array)[1]).sum()
)
)
(multi-line for clarity, since there's a lot of nested parentheses there)
np.binary_repr()
returns a string that can be then cast to int
directly, skipping over many of the casting problems.