When I process N bytes of data with SIMD instructions (reading at least 16 bytes at once), normally I simply add padding to the end of the buffer, so I can safely round up the number of 16-byte blocks to read. However, this time I need to process data prepared by an external code, so theoretically it can happen that the last 16-byte vector of data partially falls outside of the allocated memory range.
For example, let's imagine I have stored 22 bytes of data, starting from 1FFF FFE4:
1FFF FFE0: 00 00 00 00 01 02 03 04 05 06 07 08 09 0A 0B 0C
1FFF FFF0: 0D 0E 0F 10 11 12 13 14 15 16 00 00 00 00 00 00
Then I want to process the data above 16 by 16 bytes, starting from 1FFFFFE4, like this:
MOV RDX, 1FFFFFE4
MOV RCX, 2
@MAIN:
VMOVDQU XMM0, [RDX]
... data processing
ADD RDX, 16
LOOP @MAIN
The last iteration will read 16 bytes from 1FFFFFF4, while I only have only 6 valid bytes of data there, with the rest of 10 bytes being potentially out of the allocated memory range (particularly the last 4 bytes from 20000000).
Can the above code fail with access violation, in the unlikely but possible situation that the last read partially exceeds the allocated memory range, or if the first byte of the VMOVDQU argument is valid, it won't fail? Could anyone indicate in the Intel 64 SDK the exact rule for this?
If it can fail, is there any other solution than processing the end of the data in a slower but safer way (byte by byte rather than 16 by 16 bytes)? This is what I did before in such cases, but it basically means doubling the code (a SIMD and a slow code for the same task), which is extra work and potential bugs.
As the access violation is very unlikely to happen, I'm also thinking about catching the exception, loading the data in a safe way, and jumping back – this could keep the code simple, as the algorithm itself would remain, only a small code would need to be added for loading the data in a safer way, executed only in very-very rare situations. Below the code, but I don't know how to catch the exception in assembly, and I don't know whether the time penalty would be small enough to make sense:
VMOVDQU XMM0, [RDX]
@DATALOADED:
... data processing
ADD RDX, 16
... the rest of the algorithm
@EXCEPTION: // jumps here if the VMOVDQU fails with access violation, happens rarely anyway
...load data in XMM0 in a safer way
JMP @DATALOADED
I'm waiting for any other suggestions which could keep the code simple.
CodePudding user response:
Here is my take on dealing with this. I'm using a partially overlapped final iteration (plus an optional one for the initial vector loop alignment).
The advantage of this approach is that the last few elements can be dealt with in a single extra loop iteration.
The downsides are:
- Needs a fallback if the entire array is less than 16 byte
- May lead to costly load-store forwarding stalls in read-modify-write loops. Use it for
a[i] = b[i] c[i]
but nota[i] = b[i]
. If aliasing may be used, it is easy enough to modify the code to catch the casea == b || a == c
and use the fallback - May need some hardware-specific tuning when ported to AVX2 or AVX512. Specifically: Should the final iteration use the full 32 or 64 byte vectors or should it only be used for the final 16 byte vector?
- Not applicable if the elements are not position-invariant within a vector register, e.g. if you do shuffling, variable shifting, etc.
I'm also tossing in an optional alignment of one of the memory locations; here I chose the output. I don't think that is particularly necessary for AVX but it uses the same technique and might come in handy if you adapt to SSE2 or AVX512.
I'm writing this in C with Intel intrinsics but the assembler output is very readable if you want to adapt it into ASM.
#include <immintrin.h>
#include <cstddef>
void vector_add(float* out, std::ptrdiff_t n, const float* left, const float* right)
{
__m128 left_i, right_i, out_i;
std::ptrdiff_t i = 0;
if(n >= 4) {
# ifdef ALIGN_OUTPUT
/*
* Optional: Do one unaligned iteration, then move the counter
* up to the first 16-byte aligned output element
*/
left_i = _mm_loadu_ps(left);
right_i = _mm_loadu_ps(right);
out_i = _mm_add_ps(left_i, right_i);
_mm_storeu_ps(out, out_i);
i = ((reinterpret_cast<std::ptrdiff_t>(out 4) & ~15)
- reinterpret_cast<std::ptrdiff_t>(out)) / sizeof(float);
# endif
for(; n - i >= 4; i = 4) {
left_i = _mm_loadu_ps(left i);
right_i = _mm_loadu_ps(right i);
out_i = _mm_add_ps(left_i, right_i);
# ifdef ALIGN_OUTPUT
_mm_store_ps(out i, out_i);
# else
_mm_storeu_ps(out i, out_i);
# endif
}
if(n - i > 0) {
/*
* Since we know we had at least 4 elements, we can just
* repeat the operation for the last full vector.
* If we use ALIGN_OUTPUT, have misaligned pointers, and n == 4,
* then we compute the same 4 elements twice.
* Probably not worth fixing
*/
i = n - 4;
left_i = _mm_loadu_ps(left i);
right_i = _mm_loadu_ps(right i);
out_i = _mm_add_ps(left_i, right_i);
_mm_storeu_ps(out i, out_i);
}
return;
}
/* Fallback if n <= 3 */
if(n >= 2) {
left_i = _mm_loadl_pi(_mm_undefined_ps(), (const __m64*) left);
right_i = _mm_loadl_pi(_mm_undefined_ps(), (const __m64*) right);
out_i = _mm_add_ps(left_i, right_i);
_mm_storel_pi((__m64*) out, out_i);
i = 2;
}
if(n - i >= 1)
out[i] = left[i] right[i];
}