Understanding the Float conversion behaviour of pyspark-CodePudding

When I convert the python float 77422223.0 to a spark FloatType, I get 77422224. If I do so with DoubleType I get 77422223. How is this conversion working and is there a way to compute when such an error will happen?

df = spark.createDataFrame([77422223.0],FloatType())
display(df)

output

and as expected running

df = spark.createDataFrame([77422223.0],DoubleType())
display(df)

yields

CodePudding user response：

How is this conversion working…

I assume the Spark FloatType is IEEE-754 binary32. This format uses a 24-bit significand and an exponent range from −126 to 127. Each number is represented as a sign and a 24-bit numeral with a “.” after its first digit multiplied by two to the power of an exponent, such as 1.01001100001111100000000₂•2¹³.

In binary, 77,422,223 is 100100111010101111010001111₂. That is 27 bits. So it cannot be represented in the binary32 format. When it is converted to the binary32 format, the conversion operating rounds it to the nearest representable value. That is 100100111010101111010010000₂, which has 23 significant digits.

… and is there a way to compute when such an error will happen?

When the number is written in binary, if the number of bits from the first 1 to its last 1, including both of those, is more than 24, then it cannot be represented in the binary32 format.

Also, if the magnitude of the number is less than 2⁻¹²⁶, it cannot be represented in binary32 unless it is a multiple of 2⁻¹⁴⁹, including zero. Numbers in this range are subnormal and have a fixed exponent −126, and the lowest bit of the significand has a position value of 2⁻¹⁴⁹. And, if the magnitude number is 2¹²⁸ or greater, it cannot be represented, unless it is ∞ or −∞.