Program in assembly x86-CodePudding

I recently made a program with C and ASM. Can anyone help me make this code a more efficient one , in the ASM part or both. I would really appreciate it because i dont know every asm instriction and probably i am using way too many. BTW the program sums two integer vectors with any size. The code that i have is the one above:

C :


extern "C" {
    int add_vtr_asm(int*, int*, int*, int);
}

void add_vtr() {

    __declspec(align(16))
        int vetor1[1024];
    __declspec(align(16))
        int vetor2[1024];
    __declspec(align(16))
        int soma[1024];
    for (i = 0; i <= 1023; i  ) {
        vetor1[i] = i;
        vetor2[i] = i;
    }
    add_vtr_asm(vetor1, vetor2, soma, 1024);
    for (i = 0; i <= 1023; i  ) {
        printf("% d   % d = % d \n",vetor1[i] ,vetor2[i], soma[i]);
     
    }
    exit(0);
}

int main()
{

    printf("Programa para somar vetores de inteiros: \n");
    printf("Soma de vetores com % d elementos \n", 1024);
    add_vtr();
}

ASM:
 

.MODEL FLAT, C  

.CODE             
add_vtr_asm PROC 
    push ebp 
    mov ebp,esp
    push esi 
    push edi 
    mov esi,[ebp 8] 
    mov ebx, [ebp 12]
    mov edi, [ebp 16]
    mov ecx,[ebp 20]
    shr ecx,2  
    next:movdqa XMM0,[esi]
    add esi,16
    paddd xmm0,[ebx]
    add ebx,16
    movdqa [edi],xmm0
    add edi,16
    dec ecx
    jnz next
    pop edi
    pop esi
    pop ebp
    ret
    add_vtr_asm ENDP
    END

CodePudding user response：

I just coded it up in simple c and get this: https://godbolt.org/z/P1zPWv65b

struct Vec {
    alignas(16) int data[1024];
};

Vec add(const Vec &v1, const Vec &v2) {
    Vec v;
    for (int i = 0; i < 1024;   i) {
        v.data[i] = v1.data[i]   v2.data[i];
    }
    return v;
}

Compiling with: g -std=c 20 -O2 -W -Wall

add(Vec const&, Vec const&):
        xor     eax, eax
.L2:
        movdqa  xmm0, XMMWORD PTR [rsi rax]
        paddd   xmm0, XMMWORD PTR [rdx rax]
        add     rax, 16
        movaps  XMMWORD PTR [rax-16 rdi], xmm0
        cmp     rax, 4096
        jne     .L2
        mov     rax, rdi
        ret

I don't see how I can improve that. That's not even vectorized, the plain optimization already picks SIMD just fine.

Note: without alignas(16) or std::array<int, 1024> the compiler uses movdqu which I assume might be slower. But I didn't test that. It's likely the whole code is limited by the memory bandwidth.

Adding -funroll-loops makes the loop do 128 bytes at a time using xmm0-xmm7 but you have to measure yourself if that is better.

The c/c code is much more readable and will work on any cpu. A simple vector addition is so simple for the compiler to optimize that writing it in asm can only make it worse.

CodePudding user response：

Let us start with some benchmark here. Simplest possible C solution:

#include <iostream>
#include <chrono>
#include <array>
#include <numeric>
#include <iterator>

constexpr size_t VectorSize = 100'000'000u;

std::array<int, VectorSize> vector1;
std::array<int, VectorSize> vector2;
std::array<int, VectorSize> vector3;

int main() {

    std::iota(std::begin(vector1), std::end(vector1), 0);
    std::iota(std::begin(vector2), std::end(vector2), 0);

    auto tp1 = std::chrono::high_resolution_clock::now();

    for (size_t k{}; k < VectorSize;   k)
        vector3[k] = vector1[k]   vector2[k];

    auto tp2 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nDuration for adding: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n";

    return std::accumulate(std::begin(vector3), std::end(vector3), 0);
}

Shows (maybe not that that reliable) on my 10 years old Windows7 machine 162ms.

Which would be equivalent somehow to 1,62ns per one addition. New machines will be much faster.

I slightly doubt that any handwritten asm-code may outperform that . . .

Please publish the result of your measurement.