I understand that floating point precision has only so many bits. It comes as no surprise that the following code thinks that (float)(UINT64_MAX)
and (float)(UINT64_MAX - 1)
are equal. I am trying to write a function which would detect this type of, for a lack of proper term, "conversion overflow". I thought I could somehow use FLT_MAX
but that's not correct. What's the right way to do this?
#include <iostream>
#include <cstdint>
int main()
{
uint64_t x1(UINT64_MAX);
uint64_t x2(UINT64_MAX - 1);
float f1(static_cast<float>(x1));
float f2(static_cast<float>(x2));
std::cout << f1 << " == " << f2 << " = " << (f1 == f2) << std::endl;
return 0;
}
CodePudding user response:
Largest uint64 which can be accurately represented in a float
What's the right way to do this?
When FLT_RADIX == 2
, we are looking for a uint64_t
of the form below where n
is the max number of bits encodable in a float
value. This is usually 24. See FLT_MANT_DIG
from <float.h>
.
111...(total of n binary digits)...111000...(64-n bits all zero)...000.
//
//1234561234567890
0xFFFFFF0000000000
// e.g.
~( (1ull << (64-FLT_MANT_DIG)) - 1)
CodePudding user response:
The following function gives you the highest integer exactly representable in a floating point type such that all smaller positive integers are also exactly representable.
template<typename T>
T max_representable_integer()
{
return std::pow(T(std::numeric_limits<T>::radix), std::numeric_limits<T>::digits);
}
It does the computation in the floating point as for some the result may not be representable in a uint64_t
.