For a given precision, what is the maximum value for which a float32 will give the same result as a-CodePudding

With numpy, I'm trying to understand what is the maximum value that can be downcasted from float64 to float32 with a loss on accuracy less or equal to 0.001.

Since I could not find a simple explanation online, I quickly came up with this piece of code to test :

result = {}
for j in range(1,1000):
    for i in range (1, 1_000_000):
        num = i   j/1000
        x=np.array([num],dtype=np.float32)
        y=np.array([num],dtype=np.float64)
        if abs(x[0]-y[0]) > 0.001:
            result[j] = i
            break

Based on the results, it seems any positive value <32768 can be safely downcasted from float64 to float32 with an acceptable loss on accuracy (given the criteria of <=0.001)

Is this correct ? Could someone explain the math behind ?

Thanks a lot

CodePudding user response：

Assuming IEEE 754 representation, float32 has a 24-bit significand precision, while float64 has a 53-bit significand precision (except for “denormal” numbers).

In order to represent a number with an absolute error of at most 0.001, you need at least 9 bits to the right of the binary point, which means the numbers are rounded off to the nearest multiple of 1/512, thus having a maximum representation error just under 1/1024 = 0.0009765625 < 0.001.

With 24 significant bits in total, and 9 to the right of the binary point, that leaves 15 bits to the left of the binary point, which can represent all integers up to 2¹⁵ = 32768, as you have experimentally determined.

CodePudding user response：

A float has 23 bits of mantissa and is able to represent numbers with an error smaller than 0.001 up to 0.001 x 2^24 ~ 16777.