Home > Blockchain >  How to load into __m256 from a float* but reading backwards in memory as opposed to forwards?
How to load into __m256 from a float* but reading backwards in memory as opposed to forwards?

Time:12-08

I've got an array of floats that I'd like to access in reverse order. In my non-vectorized code this is easy.

Here is a simplifed version of the data that I have.

float A[8] = {a, b, c, d, e, f, g, h};
float B[8] = {s, t, u, v, w, x, y, z};

Here is the operation I would like to do.

float C[8] = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};

I'd like to be able to do some kind of load_ps operation that will give me something like this:

__m256 A_Loaded         = _mm256_load_ps(&A[0]);
                        = {a, b, c, d, e, f, g, h};

__m256 B_LoadedReversed = _mm256_loadr_ps(&B[7]);
                        = {z, y, x, w, v, u, t, s};

__m256 Output = _mm256_mul_ps(A_Loaded, B_LoadedReversed);
              = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};

One of the data sources I have is a lookup table, so could be reversed if push comes to shove, but would much prefer to avoid that as that would compilcate other areas of the program.

I've currently got a botch workaround using _mm256_set_ps() and manually pointing to the data I need, but that is not as performative as I would like.

I know there is a 'reversed' _mm256_set_ps() (_mm256_setr_ps()), but there doesn't seem to be the _mm256_loadr_ps() that I need.

Any ideas and thoughts about this problem would be greatly appreciated! Thanks in advance.

CodePudding user response:

You can reverse the order inside a __m256 in two steps, using _mm256_permute_ps and _mm_256_permute2f128_ps.

  • _mm256_permute_ps allows you to permute within each "lane", the high and low 128-bit chunks.

  • _mm_256_permute2f128_ps allows you to permute 128-bit chunks across lanes.

It's something like this:

__m256 b = _mm256_loadr_ps(&B[0]);
b = _mm256_permute_ps(b, _MM_SHUFFLE(3, 2, 1, 0));
b = _mm256_permute2f128_ps(b, b, 1);

These instructions are documented in the Intel intrinsics guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

How does setr_ps work?

How does setr_ps() reverse things? It just reverses the arguments. Here's the version I pulled from my GCC installation:

extern __inline __m256 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_setr_ps (float __A, float __B, float __C, float __D,
                float __E, float __F, float __G, float __H)
{
  return _mm256_set_ps (__H, __G, __F, __E, __D, __C, __B, __A);
}

You can see, setr_ps() does not correspond to any underlying processor capability, it just reorders the arguments.

CodePudding user response:

You can't express this as a load -- all loads are "forward".

You'll have to use a shuffling operation. Something that has "permute" or "shuf" in its name. Probably _mm256_permutevar8x32_ps is a good bet for your case.

Something like this (if I didn't get indexes reversed):

__m256 B_LoadedReversed = _mm256_permutevar8x32_ps(
                              _mm256_load_ps(&B[7]),
                              _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7));

One of the parameters of such functions is vector of indices, or it is 8-bit immediate value (imm8) for some of them.

Each element of the parameter is the position of the source vector element in the destination vector. For imm8 there are 2 bit positions.

Some shuffling functions perform shuffle multiple times for subvectors of the given vectors, but not this one.

Many of the AVX shuffles don't shuffle across lanes (groups of 128 bits), but this one does.

  • Related