What is a bfloat16_t in the C 23 standard?-CodePudding

Cppreference documents that stdfloat includes 5 new types: float16_t, float32_t, float64_t, float128_t and bfloat16_t. While the first 4 types are self-explanatory (a float with 16, 32, 64, and 128 bits respectively), the last type bfloat16_t is not at all clear to me. What does this type represent? What does the b in its name mean?

CodePudding user response：

"bfloat16" refers to a fairly recent 16-bit floating-point format that is not a valid IEEE-754/IEC 60559 defined format. But it is related to them.

BINARY16 is just BINARY32 with smaller numbers for its components. But the size changes are evenly distributed it has both a smaller mantissa and a smaller exponent.

Bfloat16 opts for a different way to spend its 16 bits. It elects to keep the exponent the same size as BINARY32 (8-bit) while shrinking the mantissa down to 7-bits (explicit). This makes transforming between bfloat16 and BINARY32 a faster operation.

It does have some special-case weirdness, but overall, it's a truncated BINARY32. And while it's widely supported in a surprising amount of GPU hardware, it's not truly a standard.

The utility of bfloat16 comes down to two things: the speed of conversion, and the compromises of BINARY16.

BINARY16 is a great format... for colors. Having a maximum range of only 5 decimal orders of magnitude above zero is adequate for many cases of high-dynamic range rendering.

But this compromise is a problem for machine learning operations. If precision matters, you're going to have to spend 32-bits to get it. But if precision isn't all that important, being able to save 16 bits without the range and sign compromises can be useful in these applications.