I am trying to implement round function using ARM Neon intrinsics.
This function looks like this:
float roundf(float x) {
return signbit(x) ? ceil(x - 0.5) : floor(x 0.5);
}
Is there a way to do this using Neon intrinsics? If not, how to use Neon intrinsics to implement this function?
edited
After calculating the multiplication of two floats, call roundf(on armv7 and armv8).
My compiler is clang.
this can be done with vrndaq_f32
: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8.
How to do this on armv7?
edited
My implementation
// input: float32x4_t arg
float32x4_t vector_zero = vdupq_n_f32(0.f);
float32x4_t neg_half = vdupq_n_f32(-0.5f);
float32x4_t pos_half = vdupq_n_f32(0.5f);
uint32x4_t mask = vcgeq_f32(arg, vector_zero);
uint32x4_t mask_neg = vandq_u32(mask, neg_half);
uint32x4_t mask_pos = vandq_u32(mask, pos_half);
arg = vaddq_f32(arg, (float32x4_t)mask_pos);
arg = vaddq_f32(arg, (float32x4_t)mask_neg);
int32x4_t arg_int32 = vcvtq_s32_f32(arg);
arg = vcvtq_f32_s32(arg_int32);
Is there a better way to implement this?
CodePudding user response:
It's important that you define which form of rounding you really want. See Wikipedia for a sense of how many rounding choices there are.
From your code-snippet, you are asking for commercial or symmetric rounding which is round-away from zero for ties. For ARMv8 / ARM64, vrndaq_f32
should do that.
The SSE4
_mm_round_ps
and ARMv8 ARM-NEONvrndnq_f32
do bankers rounding i.e. round-to-nearest (even).
CodePudding user response:
Your solution is VERY expensive, both in cycle counts and register utilization.
Provided -(2^30) <= arg < (2^30)
, you can do following:
int32x4_t argi = vcvtq_n_s32_f32(arg, 1);
argi = vsraq_n_s32(argi, argi, 31);
argi = vrshrq_n_s32(argi, 1);
arg = vcvtq_f32_s32(argi);
It doesn't require any other register than arg
itself, and it will be done with 4 inexpensive instructions. And it works both for aarch32
and aarch64