I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision software. I tested half, single, double with a very simple code like just adding the number. the speed of half is slower than single or double. In addition, single is similar to double.
typedef double FP;
// double - double precision
// float - single precision
// _Float16 - half precision
int main(int argc, const char * argv[]) {
float time;
clock_t start1, end1;
start1 = clock();
int i;
FP temp = 0;
for(i = 0; i< 100; i ){
temp = temp i;
}
end1 = clock();
time = (double)(end1 - start1)/CLOCKS_PER_SEC;
printf("[] %.16f\n", time);
return 0;
}
In my expectation, half-precision is very faster than single or double precision. How can I check half-precision is faster and float is faster than double?.
Please Help Me.
CodePudding user response:
Here is an eminently surprising fact about floating point:
Single-precision (
float
) arithmetic is not necessarily faster than double precision.
How can this be? Floating-point arithmetic is hard, so doing it with twice the precision is at least twice as hard and must take longer, right?
Well, no. Yes, it's more work to compute with higher precision, but as long as the work is being done by dedicated hardware (by some kind of floating point unit, or FPU), everything is probably happening in parallel. Double precision may be twice as hard, and there may therefore be twice as many transistors devoted to it, but it doesn't take any longer.
In fact, if you're on a system with an FPU that supports both single- and double-precision floating point, a good rule is: always use double
. The reason for this rule is that type float
is often inadequately accurate. So if you always use double
, you'll quite often avoid numerical inaccuracies (that would kill you, if you used float
), but it won't be any slower.
Now, everything I've said so far assumes that your FPU does support the types you care about, in hardware. If there's a floating-point type that's not supported in hardware, if it has to be emulated in software, it's obviously going to be slower, often much slower. There are at least three areas where this effect manifests:
- If you're using a microcontroller, with no FPU at all, it's common for all floating point to be implemented in software, and to be painfully slow. (I think it's also common for the double precision to be even slower, meaning that
float
may be advantageous there.) - If you're using a nonstandard or less-than-standard type, that for that reason is implemented in software, it's obviously going to be slower. In particular: FPU's I'm familiar don't support a half-precision (16-bit) floating point type, so yes, it wouldn't be surprising if it was significantly slower than regular
float
ordouble
. - Some GPU's have good support for single or half precision, but poor or no support for double.
CodePudding user response:
I've extracted out the relevant part of your code into C so it can be easily instantiated for each type:
template<typename T>
T calc() {
T sum = 0;
for (int i = 0; i < 100; i ) {
sum = i;
}
return sum;
}
Compiling this in Clang with optimisations (-O3
) and looking at the assembly listing on godbolt suggests that:
- the
double
version has the least number of instructions (4) in the inner loop - the
float
version has 5 instructions in the inner loop, and looks basically comparable to thedouble version
- the
_Float16
version has 9 instructions in the inner loop, hence likely being slowest. the extra instructions arefcvt
which convert between float16 and float32 formats.
Note that counting instructions is only a rough guide to performance! E.g. Some instructions take multiple cycles to execute and pipelined execution means that multiple instructions can be executed in parallel.
Clang's language extension docs suggest that _Float16
is supported on ARMv8.2a, and M1 appears to be v8.4, so presumably it also supports this. I'm not sure how to enable this in Godbolt though, sorry!
I'd use clock_gettime(CLOCK_MONOTONIC)
for high precision (i.e. nanosecond) timing under Linux. OSX doesn't appear to make this available, but alternatives seem available Monotonic clock on OSX.