What's the meaning of shifting in fixed-point arithmetic when implementing it in C ?-CodePudding

I have a problem with understanding fixed-point arithmetic and its implementation in C . I was trying to understand this code:

#define scale 16

int DoubleToFixed(double num){
    return num * ((double)(1 << scale));
}

double FixedToDoble(int num){
    return (double) num / (double)(1 << scale);
}

double IntToFixed(int num){
    return x << scale
}

I am trying to understand exactly why we shift. I know that shifting to the right is basically multiplying that number by 2^x, where x is by how many positions we want to shift or scale, and shifting to the left is basically division by 2^x.

But why do we need to shift when we convert from int to fixed point?

CodePudding user response：

A fixed-point format represents a number as an integer multiplied by a fixed scale. Commonly the scale is some base b raised to some power e, so the integer f would represent the number f•b^e.

In the code shown, the scale is 2⁻¹⁶ or 1/65,536. (Calling the the shift amount scale is a misnomer; 16, or rather −16, is the exponent.) So if the integer representing the number is 81,920, the value represented is 81,920•2⁻¹⁶ = 1.25.

The routine DoubleToFixed converts a floating-point number to this fixed-point format by multiplying by the reciprocal of the scale; it multiplies by 65,536.

The routine FixedToDouble converts a number from this fixed-format to floating-point by multiplying by the scale or, equivalently, by dividing by its reciprocal; it divides by 65,536.

IntToFixed does the same thing as DoubleToFixed except for an int input.

CodePudding user response：

Fixed point arithmatic works on the concept of representing numbers as an integer multiple of a very small "base". Your case uses a base of 1/(1<<scale), aka 1/65536, which is approximately 0.00001525878.

So the number 3.141592653589793, could be represented as 205887.416146 units of 1/65536, and so would be stored in memory as the integer value 205887 (which is really 3.14158630371, due to the rounding during conversion).

The way to calculate this conversion of fractional-value-to-fixed-point is simply to divide the value by the base: 3.141592653589793 / (1/65536) = 205887.416146. (Notably, this reduces to 3.141592653589793 * 65536 = 205887.416146). However, since this involves a power-of-two. Multiplication by a power-of-two is the same as simply left shifting by that many bits. So multiplication of 2^16, aka 65536, can be calculated faster by simply shifting left 16 bits. This is really fast, which is why most fixed-point calculations use an inverse-power-of-two as their base.

Due to the inability to shift float values, your methods convert the base to a float and does floating point multiplication, but other methods, such as the fixed-point multiplication and division themselves would be able to take advantage of this shortcut.

Theoretically, one can use shifting bits with floats to do the conversion functions faster than simply floating point multiplication, but most likely, the compiler is actually already doing that under the covers.

It is also common for some code to use an inverse-power-of-ten as their base, primarily for money, which usually uses a base of 0.01, but these cannot use a single shift as a shortcut, and have to do slower math. One shortcut for multiplying by 100 is value<<6 value<<5 value<<2 (this is effectively value*64 value*32 value*4, which is value*(64 32 4), which is value*100), but three shifts and three adds is sometimes faster than one multiplication. Compilers already do this shortcut under the covers if 100 is a compile time constant, so in general, nobody writes code like this anymore.