Cppreference documents that stdfloat
includes 5 new types: float16_t
, float32_t
, float64_t
, float128_t
and bfloat16_t
. While the first 4 types are self-explanatory (a float with 16, 32, 64, and 128 bits respectively), the last type bfloat16_t
is not at all clear to me. What does this type represent? What does the b
in its name mean?
CodePudding user response:
"bfloat16" refers to a fairly recent 16-bit floating-point format that is not a valid IEEE-754/IEC 60559 defined format. But it is related to them.
BINARY16 is just BINARY32 with smaller numbers for its components. But the size changes are evenly distributed it has both a smaller mantissa and a smaller exponent.
Bfloat16 opts for a different way to spend its 16 bits. It elects to keep the exponent the same size as BINARY32 (8-bit) while shrinking the mantissa down to 7-bits (explicit). This makes transforming between bfloat16 and BINARY32 a faster operation.
It does have some special-case weirdness, but overall, it's a truncated BINARY32. And while it's widely supported in a surprising amount of GPU hardware, it's not truly a standard.
The utility of bfloat16 comes down to two things: the speed of conversion, and the compromises of BINARY16.
BINARY16 is a great format... for colors. Having a maximum range of only 5 decimal orders of magnitude above zero is adequate for many cases of high-dynamic range rendering.
But this compromise is a problem for machine learning operations. If precision matters, you're going to have to spend 32-bits to get it. But if precision isn't all that important, being able to save 16 bits without the range and sign compromises can be useful in these applications.