With numpy, I'm trying to understand what is the maximum value that can be downcasted from float64 to float32 with a loss on accuracy less or equal to 0.001.
Since I could not find a simple explanation online, I quickly came up with this piece of code to test :
result = {}
for j in range(1,1000):
for i in range (1, 1_000_000):
num = i j/1000
x=np.array([num],dtype=np.float32)
y=np.array([num],dtype=np.float64)
if abs(x[0]-y[0]) > 0.001:
result[j] = i
break
Based on the results, it seems any positive value <32768 can be safely downcasted from float64 to float32 with an acceptable loss on accuracy (given the criteria of <=0.001)
Is this correct ? Could someone explain the math behind ?
Thanks a lot
CodePudding user response:
Assuming IEEE 754 representation, float32 has a 24-bit significand precision, while float64 has a 53-bit significand precision (except for “denormal” numbers).
In order to represent a number with an absolute error of at most 0.001, you need at least 9 bits to the right of the binary point, which means the numbers are rounded off to the nearest multiple of 1/512, thus having a maximum representation error just under 1/1024 = 0.0009765625 < 0.001.
With 24 significant bits in total, and 9 to the right of the binary point, that leaves 15 bits to the left of the binary point, which can represent all integers up to 215 = 32768, as you have experimentally determined.
CodePudding user response:
A float has 23 bits of mantissa and is able to represent numbers with an error smaller than 0.001 up to 0.001 x 2^24 ~ 16777.