Precision limits of numpy floats?-CodePudding

Given the sequence

1/1, 1/2, 1/3, ... , 1/n

How can I calculate at what point I will not be able to distinguish with precision E between two consecutive elements 1/i and 1/i 1 if I use numpy.float16 ? i.e. what is 'i' ?

What about the other np-floats ?

What is the smallest E ? and how to calculate 'i' for it ?

For example if E = 0.01 I can distinguish between 1/9 and 1/10, but not between 1/10 and 1/11, because :

1/9 = 0.111
1/10 = 0.100
1/11 = 0.091

0.111 - 0.100 = 0.01  >= E
0.100 - 0.091 = 0.009 < E

i = 10

In more abstract way, given f(i) what is the maximum 'i' representable in np.floatXX ?

Interestingly the precision in practice is worse than calculated : /the place where logic breaks/

for i in range(int(1e3),int(12e6)) : 
   if not np.floatXX(1/i) > np.floatXX(1/(i 1)) :
       print(i); break

float32: 11864338
float16: 1464

CodePudding user response：

Instead of adding 1, double the denominator. You can safely assume that it's some binary number. Here is one simple method:

one = np.float64(1.0)
two = np.float64(2.0)
n = one
bits = 0
while one   n != one:
    bits  = 1
    n /= two

You start with bits = 0 because otherwise you would get the count of bits that took you past resolution.

In the end, you get bits = 53, which is the number of bits in an IEEE-754 encoded 64-bit floating point number.

That means that for any number, which is encoded in what is effectively binary scientific notation, the ULP (unit of least precision) is approximately n * 2**-53. Specifically, where n is the number rounded to its highest bit. You won't be able to resolve smaller relative changes in a float.

Bonus: Running the above code for the other floating point types gives:

float16 (half):   11 bits
float32 (single): 24 bits
float64 (double): 53 bits
float96 (sometimes longdouble): 80 bits
float128 (when available): 113 bits

You can modify the code above to work for any target number:

target = np.float16(0.0004883)
one = np.float16(1.0)
two = np.float16(2.0)
n = two**(np.floor(np.log2(target)) - one)
bits = 0
while target   n != target:
    bits  = 1
    n /= two

The result (ULP) is given by n * 2 since the loop stops after you lose resolution. This is the same reason we start with bits = 0. In this case:

>>> n * two
5e-07

You can entirely short circuit the computation if you know the number of bits in the mantissa up front. So for float16, where bits = 11, you can do

>>> two**(np.floor(np.log2(target)) - np.float16(bits))
5e-07

Read more here:

CodePudding user response：

My other answer provides the theory behind what you are actually asking, but needs some non-trivial interpretation. Here is the missing step:

Given some integer i, you can write

1 / i - 1 / (i   1) =
(i   1 - i) / (i * (i   1)) =
1 / (i * (i   1)) =
1 / (i**2   i)

To find an i such that 1 / (i**2 i) falls below ULP of 1 / i in some binary representation, you can use my other answer pretty directly.

The ULP of 1 / i is given by

ulp = 2**(floor(log2(1 / i)) - (bits   1))

You can try to find an i such that

1 / (i**2   i) < 2**(floor(log2(1 / i)) - (bits   1))
1 / (i**2   i) < 2**floor(log2(1 / i)) / 2**(bits   1)
2**(bits   1) < (i**2   i) * 2**floor(log2(1 / i))

It's not trivial as written because of the floor operation, and Wolfram Alpha runs out of time. Since I am cheap and don't want to buy Mathematica, let's just approximate:

2**(bits   1) < (i**2   i) * 2**floor(log2(1 / i))
2**(bits   1) < (i**2   i) / i
2**(bits   1) < i   1

You may be off by one or so, but you should see that around i = 2**(bits 1) - 1, the difference stops being resolvable. And indeed for the 11 bit mantissa of float16, we see:

>>> np.float16(1 / (2**12 - 1)) - np.float16(1 / (2**12))
0.0

The actual number is a tiny bit less here (remember that approximation where we took away the floor):

>>> np.float16(1 / (2**12 - 5)) - np.float16(1 / (2**12 - 4))
0.0
>>> np.float16(1 / (2**12 - 6)) - np.float16(1 / (2**12 - 5))
2.4e-07

As you noted in your comments, i is

>>> 2**12 - 6
4090

You can compute the exact values for all the other floating point types in a similar manner. But that is indeed left as an exercise for the reader.