I have four uint16 named a,b,c,d respectively, and now I'd like to swap them like this:
void swap4(uint16_t &a, uint16_t &b, uint16_t &c, uint16_t &d) {
uint16_t temp = a;
a = b;
b = c;
c = d;
d = temp;
}
Is there anything I can do to speed up this procedure?
CodePudding user response:
In C this is
void swap4(uint16_t &a, uint16_t &b, uint16_t &c, uint16_t &d) {
std::tie(a, b, c, d) = std::make_tuple(b, c, d, a);
}
CodePudding user response:
As noted, first make sure this is really a bottleneck: most compilers should generate efficient code for this (unless there's a possibility of aliasing between the arguments).
If it happens that these 16-bit values are stored contiguously in memory (e.g. this is a four-element vector), then (a) make sure they're aligned on the right boundary! and (b) you can use your CPU's shuffle instruction, which is an optimization that your compiler might or might not recognize on its own. Before you go any further, check your compiler's assembly output; modern GCC with -O2 does in fact automatically recognize this simplification (https://godbolt.org/z/qo1jxnbds).
If you really want to hand-roll, GCC provides a portable __builtin_shuffle macro for this; for your use case you could write
typedef uint16_t quadword __attribute__ ((vector_size (8)));
quadword input = {a, b, c, d};
const quadword rotate_mask = {1, 2, 3, 0};
quadword output = __builtin_shuffle (input, rotate_mask);
(You probably don't want to write exactly that, but to recast your data as an array of those quadword types--- see the Compiler Explorer link above for an example.)
For x86 the underlying instruction generated by this macro is pshufb/pshufw, which (if you're not on GCC, or don't want to be portable) you could access with the _mm_shuffle_pi16 intrinsic (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=shuffle&techs=MMX,SSE&ig_expand=6426). Every modern RISC architecture offers something similar.
CodePudding user response:
Well, the code as-is should be compiled to the optimal sequence of assembly instructions, minimal optimization level assumed. That is, if aliasing is allowed, otherwise see 273k who loads all values before than writing them all.
But it is so short, and has so much indirection, that that isn't where all the optimization potential is anyway.
Inlining and optimizing the resulting bigger chunk has a far bigger impact. Also, it will allow the compiler to see whether aliasing happens.
Good new is, if you allow the compiler, it will get that done, be it thanks to lto, whole-program-optimization, or the like (so depends on calling the compiler right), or the definition being in the same TU as the call (which might call for putting it as an inline-function in the header, or marking it static).