Home > other >  Are there are ARM Neon instructions for round function?
Are there are ARM Neon instructions for round function?

Time:10-30

I am trying to implement round function using ARM Neon intrinsics.

This function looks like this:

float roundf(float x) {
    return signbit(x) ? ceil(x - 0.5) : floor(x   0.5);
}

Is there a way to do this using Neon intrinsics? If not, how to use Neon intrinsics to implement this function?

edited

After calculating the multiplication of two floats, call roundf(on armv7 and armv8).

My compiler is clang.

this can be done with vrndaq_f32: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8.

How to do this on armv7?

edited

My implementation

// input: float32x4_t arg
float32x4_t vector_zero = vdupq_n_f32(0.f);
float32x4_t neg_half = vdupq_n_f32(-0.5f);
float32x4_t pos_half = vdupq_n_f32(0.5f);

uint32x4_t mask = vcgeq_f32(arg, vector_zero);
uint32x4_t mask_neg = vandq_u32(mask, neg_half);
uint32x4_t mask_pos = vandq_u32(mask, pos_half);
arg = vaddq_f32(arg, (float32x4_t)mask_pos);
arg = vaddq_f32(arg, (float32x4_t)mask_neg);
int32x4_t arg_int32 = vcvtq_s32_f32(arg);
arg = vcvtq_f32_s32(arg_int32);

Is there a better way to implement this?

CodePudding user response:

It's important that you define which form of rounding you really want. See Wikipedia for a sense of how many rounding choices there are.

From your code-snippet, you are asking for commercial or symmetric rounding which is round-away from zero for ties. For ARMv8 / ARM64, vrndaq_f32 should do that.

The SSE4 _mm_round_ps and ARMv8 ARM-NEON vrndnq_f32 do bankers rounding i.e. round-to-nearest (even).

CodePudding user response:

Your solution is VERY expensive, both in cycle counts and register utilization.

Provided -(2^30) <= arg < (2^30), you can do following:

int32x4_t argi = vcvtq_n_s32_f32(arg, 1);
argi = vsraq_n_s32(argi, argi, 31);
argi = vrshrq_n_s32(argi, 1);
arg = vcvtq_f32_s32(argi);

It doesn't require any other register than arg itself, and it will be done with 4 inexpensive instructions. And it works both for aarch32 and aarch64

godblot link

  • Related