Home > Net >  C 32 bit float to 32 bit integer conversion
C 32 bit float to 32 bit integer conversion

Time:08-06

Recently I tried to make a "clever" macro for float (32bit) division with rounding up to unsigned int (32bit):

#include "float.h"
#include "stdint.h"

#define DIV_ROUND_UP(x, y)          (uint32_t)(((float)(x) / (float)(y))   (float)(1.0f - FLT_EPSILON))

The FLT_EPSILON was used, to add something less than 1, so DIV_ROUND_UP(10u, 1u) will result in 10u, not 11u.

I was very surprised when DIV_ROUND_UP(10u, 1u) actually returned 11u. I checked experimentally that I have to use 5 * FLT_EPSILON to get 10u, but I still don't understand why. I assumed that if I add something less than 1, it will be truncated during conversion to uint32_t. Could anyone explain why it is not truncated? And why 5 * FLT_EPSILON works?

EDIT: I end up with the following solution, which works for positive numbers:

#define DIV_ROUND_UP(x, y) (((float)(x) / (float)(y)) > (uint32_t)((float)(x) / (float)(y)) ? \
(uint32_t)((float)(x) / (float)(y))   1u : \
(uint32_t)((float)(x) / (float)(y)))

CodePudding user response:

FLT_EPSILON is the distance between 1 and the next representable number. In a binary-based floating-point format, the representable numbers between 1 and 2 are spaced a distance FLT_EPSILON apart, and the numbers between 2 and 4 are twice that distance apart, and the numbers between 4 and 8 are four times FLT_EPSILON apart, and the numbers between 8 and 16 are eight times FLT_EPSILON apart, and so on. This is because a floating-point number is represented as a significand multiplied by a scale. FLT_EPSILON is the distance between numbers at the scale used for the number 1, so the distances between numbers at other scales are proportional to their scales.

At 10, the distance between representable numbers is 8•FLT_EPSILON. When you add 1.0f - FLT_EPSILON to 10, the real-number-arithmetic result would be 11 − FLT_EPSILON. But that number is not representable because the representable numbers are 8•FLT_EPSILON apart; a single FLT_EPSILON is too small a difference. The nearest representable numbers to 11 − FLT_EPSILON are 11 and 11 − 8•FLT_EPSILON. The default rule for producing a result is to produce the nearest representable number in either direction. Since 11 is closer to 11 − FLT_EPSILON than 11 − 8•FLT_EPSILON is, 11 is produced as the result.

When you add 1.0f - 5 * FLT_EPSILON, the real-number result would be 11 − 5•FLT_EPSILON. This is closer to 11 − 8•FLT_EPSILON than 11 is, so 11 − 8•FLT_EPSILON is produced as the result of the subtraction. Then converting that to uint32_t truncates, producing 10.

To get floating-point division to round up in many cases, simply use ceil: #define DIV_ROUND_UP(x, y) ((uint32_t) ceil((float) (x) / (float) (y))). This may not work in certain cases where x/y is so slightly above an integer that the floating-point division rounds to that integer before the ceil function operates.

  • Related