I'm trying to implement the following operation using AVX:
for (i=0; i<N; i ) {
for(j=0; j<N; j ) {
for (k=0; k<K; k ) {
d[i][j] = 2 * a[i][k] * ( b[k][j]- c[k]);
}
}
}
for (int i=0; i<N; i ){
f = d[ind[i]][ind[i]]/2;
}
Where d is a NxN matrix, a is a NxK, b a KxN and c a vector of lenght K. All of them are doubles. Of course, all the data is alligned and I am usign #pragma vector aligned
to help compiler (gcc).
I know how to use AVX extensions with one-dimension arrays, but it is being a little bit tricky to me to do it wiht matrix. Currently, I have the following, but I'm not getting correct results:
for (int i=0; i< floor (N/4); i ){
for (int j=0; j< floor (N/4); j ){
__m256d D, A, B, C;
D = _mm256_setzero_pd();
#pragma vector aligned
for (int k=0; k<K_MAX; k ){
A = _mm256_load_pd(a[i] k*4);
B = _mm256_load_pd(b[k] j*4);
C = _mm256_load_pd(c 4*k);
B = _mm256_sub_pd(B, C);
A = _mm256_mul_pd(A, B);
D = _mm256_add_pd(_mm256_set1_pd(2.0), A);
_mm256_store_pd(d[i] j*4, D);
}
}
}
for (int i=0; i<N; i ){
f = d[ind[i]][ind[i]]/2;
}
I hope someone can tell me where the mistake is.
Thanks in advance.
NOTE: I'm not willing to introduce OpenMP, just using SIMD Intel instuctions
CodePudding user response:
Assuming both N and K numbers are relatively large (much larger than 4 which is hardware vector size), here's one way to vectorize your main loop. Untested.
The main idea is vectorizing the middle loop instead of the inner one. This is done for two reasons.
This avoids horizontal operations. When vectorizing just the inner loop, we would have to compute horizontal sum of a vector.
That
b[k][j]
load has unfortunate RAM access pattern when loading for 4 consecutivek
values, need either 4 separate load instructions, or scatter load, both methods are relatively slow. Loading elements for 4 consecutivej
values is a full-vector load instruction, very efficient, especially since you align your inputs.
const int N_aligned = ( N / 4 ) * 4;
for( int i = 0; i < N; i )
{
int j = 0;
for( ; j < N_aligned; j = 4 )
{
// Load 4 scalars from d
__m256d dv = _mm256_loadu_pd( &d[ i ][ j ] );
// Run the inner loop which only loads from RAM but never stores any data
for( int k = 0; k < K; k )
{
__m256d av = _mm256_broadcast_sd( &a[ i ][ k ] );
__m256d bv = _mm256_loadu_pd( &b[ k ][ j ] );
__m256d cv = _mm256_broadcast_sd( &c[ k ] );
// dv = 2*av*( bv - cv )
__m256d t1 = _mm256_add_pd( av, av ); // 2*av
__m256d t2 = _mm256_sub_pd( bv, cv ); // bv - cv
dv = _mm256_fmadd_pd( t1, t2, dv );
}
// Store the updated 4 values
_mm256_storeu_pd( &d[ i ][ j ], dv );
}
// Handle remainder with scalar code
for( ; j < N; j )
{
double ds = d[ i ][ j ];
for( int k = 0; k < K; k )
ds = 2 * a[ i ][ k ] * ( b[ k ][ j ] - c[ k ] );
d[ i ][ j ] = ds;
}
}
If you want to optimize further, try to unroll the inner loop by a small factor like 2, use 2 independent accumulators initialized with _mm256_setzero_pd()
, add them after the loop. It could be that on some processors, this version stalls on the latency of the FMA instruction, instead of saturating load ports or ALUs. Multiple independent accumulators sometimes help.