Can greatest integer function (floor) be implemented with bitwise operators?-CodePudding

The greatest integer function ⌊x⌋ is a function that gives the greatest integer less than or equal to a given real number x. It is also called floor.

The <math.h> header provides floor, which computes the largest integer value not greater than its floating-point argument.

I was wondering if one can implement this with the help of bitwise operators like <<, >>, or ~.

CodePudding user response：

This is sometimes possible, provided the representation of the fractional number is known and based on the binary numeration (which rules out DCB or fixed-point decimal numbers).

In the case of IEEE floats, you can extract the mantissa and exponent and based on these obtain the integer part by bitfield extraction and shifts (plus restoring the initial bit).

Needless to say, this will be complex, inefficient and little portable.

CodePudding user response：

floor() can certainly be implemented using only bit operations for the commonly used IEEE-754 binary floating-point formats, and likely for all binary floating-point formats. Because this approach results in a slow implementation, it likely has little or no practical relevance.

floor() rounds a floating-point operand to an integer using the rounding mode round-towards-negative-infinity. For positive operands, floor() behaves identically to trunc(), which also converts to an integer but uses the rounding mode round-towards-zero. Truncation basically removes the fractional bits from a floating-point operand.

The binary exponent in a IEEE-754 binary floating-point format tells us where in the significand (mantissa) pf the floating-point operand the fractional bits start, and we can simply mask them away. Very large numbers do not have any fractional bits and numbers less than 1 in magnitude have only fractional bits and thus result in zero. An exemplary implementation of truncf() for IEEE-754 binary32 mapped to float looks as follows:

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float my_truncf (float a) 
{ 
    const uint32_t FP32_EXPO_MASK = 0x7f800000u;
    const uint32_t FP32_ALL_ONES  = 0xffffffffu;
    const uint32_t FP32_MAX_EXPO  = 128;
    const uint32_t FP32_EXPO_BIAS = 127;
    const uint32_t FP32_NBRBITS_M1= 31;
    const uint32_t FP32_STORED_MANT_BITS = 23;
    uint32_t s, t, ur, ua =  float_as_uint32 (a); 

    s = ((ua & FP32_EXPO_MASK) >> FP32_STORED_MANT_BITS) - FP32_EXPO_BIAS; 
    t = (s > FP32_STORED_MANT_BITS) ? 
         ((s > FP32_MAX_EXPO) ? FP32_NBRBITS_M1 : 0) : 
         (FP32_STORED_MANT_BITS - s); 
    ur = ua & (FP32_ALL_ONES << t); 

    return uint32_as_float (ur);
}

For negative operands, trunc() differs from floor() when the operand has a non-zero fractional part. So we can construct floor() by first computing trunc(), and then conditionally subtracting 1 from the result. Obviously we want to accomplish this by bit-level manipulation rather than by a floating-point subtraction. We observe that if a floating-point operand is sufficiently large, subtracting 1 doesn't change the value. We further observe that if the magnitude of a negative floating-point operand is less than 1 but not zero, so in (-1, 0) the result of floor() is -1.

For the cases where we do need to subtract 1 the exponent of the floating point operand tells us where the least significant integer bit is located within the significand (mantissa). Subtracting 1 from the value means adding 1 to the magnitude and IEEE-754 binary floating-point formats use a sign-magnitude representation. This leads to the following exemplary implementation of floorf() for IEEE-754 binary32 mapped to float:

float my_floorf (float a) 
{
    const uint32_t FP32_SIGN_MASK = 0x80000000u;
    const uint32_t FP32_EXPO_MASK = 0x7f800000u;
    const uint32_t FP32_MANT_MASK = 0x007fffffu;
    const uint32_t FP32_INT_BIT   = 0x00800000u;
    const uint32_t FP32_NEG_ONE   = 0xbf800000u;
    const uint32_t FP32_NEG_ZERO  = 0x80000000u;
    const uint32_t FP32_ALL_ONES  = 0xffffffffu;
    const uint32_t FP32_MAX_EXPO  = 128;
    const uint32_t FP32_EXPO_NBIAS= (uint32_t)(-127);
    const uint32_t FP32_NBRBITS_M1= 31;
    const uint32_t FP32_STORED_MANT_BITS = 23;
    uint32_t is_negative, gt_m1_ne_m0, has_fraction;
    uint32_t s, t, ur, ua =  float_as_uint32 (a); 

    s = ((ua & FP32_EXPO_MASK) >> FP32_STORED_MANT_BITS)   FP32_EXPO_NBIAS; 
    t = (s > FP32_STORED_MANT_BITS) ? 
         ((s > FP32_MAX_EXPO) ? FP32_NBRBITS_M1 : 0) : 
         (FP32_STORED_MANT_BITS - s); 
    ur = ua & (FP32_ALL_ONES << t); 

    is_negative = (ua & FP32_SIGN_MASK) >> FP32_NBRBITS_M1;
    gt_m1_ne_m0 = is_negative & (s > FP32_MAX_EXPO) & ~(ua == FP32_NEG_ZERO);
    has_fraction = ((FP32_STORED_MANT_BITS > s) & 
                    ~((ua & (FP32_MANT_MASK >> s)) == 0));

    ur = ((is_negative & has_fraction & ~gt_m1_ne_m0) ? (ur   (FP32_INT_BIT >> s)) : 
          ((gt_m1_ne_m0) ? FP32_NEG_ONE : 
           ur));

    return uint32_as_float (ur);
}

A few of the operations here are of arithmetic nature rather than bit-manipulation operations. We recall that, in twos complement arithmetic, a-b = a ~b 1. We also recall how a binary addition is composed from a sum-bit vector and a carry-bit vector, where the bits in the carry-bit vector have twice the weight of the bits in the sum-bit vector: r = sum 2 * carry, where sum = a ^ b and carry = a & b. Lastly, we recall that unsigned comparisons are based on the carry-out of addition.

These fundamental facts allow us to construct helper functions add32(), gtu32(), and eq32() for addition, unsigned greater-than comparison, and equality comparison from bit-level operations. Note that comparison for (in-)equality with zero is a bit-level operation, as it simply requires ORing all the bits of the operand together. With the emulation functions in place, the implementation of floorf() above turns into the following:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t add32 (uint32_t a, uint32_t b)
{
    uint32_t sum, carry;
    carry = b;
    do {
        sum = a ^ carry;
        carry = (a & carry) << 1;
        a = sum;
    } while (carry != 0);
    return sum;
}

uint32_t gtu32 (uint32_t a, uint32_t b)
{
    uint32_t r, s;
    b = ~b;
    s = a ^ b;
    r = a & b;
    s = s >> 1;
    r = add32 (r, s);
    return r >> 31;
}

uint32_t eq32 (uint32_t a, uint32_t b)
{
    return (a ^ b) == 0;
}

float my_floorf_bit_ops_only (float a) 
{
    const uint32_t FP32_SIGN_MASK = 0x80000000u;
    const uint32_t FP32_EXPO_MASK = 0x7f800000u;
    const uint32_t FP32_MANT_MASK = 0x007fffffu;
    const uint32_t FP32_INT_BIT   = 0x00800000u;
    const uint32_t FP32_NEG_ONE   = 0xbf800000u;
    const uint32_t FP32_NEG_ZERO  = 0x80000000u;
    const uint32_t FP32_ALL_ONES  = 0xffffffffu;
    const uint32_t FP32_MAX_EXPO  = 128;
    const uint32_t FP32_EXPO_NBIAS= (uint32_t)(-127);
    const uint32_t FP32_NBRBITS_M1= 31;
    const uint32_t FP32_MANT_BITS = 24;
    const uint32_t FP32_STORED_MANT_BITS = 23;
    uint32_t is_negative, gt_m1_ne_m0, has_fraction;
    uint32_t s, t, ur, ua =  float_as_uint32 (a); 

    s = add32 ((ua & FP32_EXPO_MASK) >> FP32_STORED_MANT_BITS, FP32_EXPO_NBIAS);
    t = gtu32 (s, FP32_STORED_MANT_BITS) ? 
        (gtu32 (s, FP32_MAX_EXPO) ? FP32_NBRBITS_M1 : 0) : 
        add32 (FP32_MANT_BITS, ~s); 
    ur = ua & (FP32_ALL_ONES << t); 

    is_negative = (ua & FP32_SIGN_MASK) >> FP32_NBRBITS_M1;
    gt_m1_ne_m0 = is_negative & gtu32 (s, FP32_MAX_EXPO) & ~eq32 (ua, FP32_NEG_ZERO);
    has_fraction = (gtu32 (FP32_STORED_MANT_BITS, s) & 
                    ~((ua & (FP32_MANT_MASK >> s)) == 0));

    ur = ((is_negative & has_fraction & ~gt_m1_ne_m0) ? (add32 (ur, (FP32_INT_BIT >> s))) : 
          ((gt_m1_ne_m0) ? FP32_NEG_ONE : 
           ur));

    return uint32_as_float (ur);
}

int main (void)
{
    uint32_t uarg, ures, uref;
    float res, ref, arg;

    uarg = 0x00000000;
    do {
        arg = uint32_as_float (uarg);
        res = my_floorf_bit_ops_only (arg);
        ref = floorf (arg);
        ures = float_as_uint32 (res);
        uref = float_as_uint32 (ref);
        if (!((ures == uref) || (isnan (arg) && (ures == uarg)))) {
            printf ("ERR: arg=x res=x ref=x\n", uarg, ures, uref);
            return EXIT_FAILURE;
        }
        uarg  ;
    } while (uarg);
    return EXIT_SUCCESS;
}