Home > other >  Program in assembly x86
Program in assembly x86

Time:05-14

I recently made a program with C and ASM. Can anyone help me make this code a more efficient one , in the ASM part or both. I would really appreciate it because i dont know every asm instriction and probably i am using way too many. BTW the program sums two integer vectors with any size. The code that i have is the one above:

C :


extern "C" {
    int add_vtr_asm(int*, int*, int*, int);
}

void add_vtr() {

    __declspec(align(16))
        int vetor1[1024];
    __declspec(align(16))
        int vetor2[1024];
    __declspec(align(16))
        int soma[1024];
    for (i = 0; i <= 1023; i  ) {
        vetor1[i] = i;
        vetor2[i] = i;
    }
    add_vtr_asm(vetor1, vetor2, soma, 1024);
    for (i = 0; i <= 1023; i  ) {
        printf("% d   % d = % d \n",vetor1[i] ,vetor2[i], soma[i]);
     
    }
    exit(0);
}

int main()
{

    printf("Programa para somar vetores de inteiros: \n");
    printf("Soma de vetores com % d elementos \n", 1024);
    add_vtr();
}

ASM:
 

.MODEL FLAT, C  

.CODE             
add_vtr_asm PROC 
    push ebp 
    mov ebp,esp
    push esi 
    push edi 
    mov esi,[ebp 8] 
    mov ebx, [ebp 12]
    mov edi, [ebp 16]
    mov ecx,[ebp 20]
    shr ecx,2  
    next:movdqa XMM0,[esi]
    add esi,16
    paddd xmm0,[ebx]
    add ebx,16
    movdqa [edi],xmm0
    add edi,16
    dec ecx
    jnz next
    pop edi
    pop esi
    pop ebp
    ret
    add_vtr_asm ENDP
    END

CodePudding user response:

I just coded it up in simple c and get this: https://godbolt.org/z/P1zPWv65b

struct Vec {
    alignas(16) int data[1024];
};

Vec add(const Vec &v1, const Vec &v2) {
    Vec v;
    for (int i = 0; i < 1024;   i) {
        v.data[i] = v1.data[i]   v2.data[i];
    }
    return v;
}

Compiling with: g -std=c 20 -O2 -W -Wall

add(Vec const&, Vec const&):
        xor     eax, eax
.L2:
        movdqa  xmm0, XMMWORD PTR [rsi rax]
        paddd   xmm0, XMMWORD PTR [rdx rax]
        add     rax, 16
        movaps  XMMWORD PTR [rax-16 rdi], xmm0
        cmp     rax, 4096
        jne     .L2
        mov     rax, rdi
        ret

I don't see how I can improve that. That's not even vectorized, the plain optimization already picks SIMD just fine.

Note: without alignas(16) or std::array<int, 1024> the compiler uses movdqu which I assume might be slower. But I didn't test that. It's likely the whole code is limited by the memory bandwidth.

Adding -funroll-loops makes the loop do 128 bytes at a time using xmm0-xmm7 but you have to measure yourself if that is better.

The c/c code is much more readable and will work on any cpu. A simple vector addition is so simple for the compiler to optimize that writing it in asm can only make it worse.

CodePudding user response:

Let us start with some benchmark here. Simplest possible C solution:

#include <iostream>
#include <chrono>
#include <array>
#include <numeric>
#include <iterator>

constexpr size_t VectorSize = 100'000'000u;

std::array<int, VectorSize> vector1;
std::array<int, VectorSize> vector2;
std::array<int, VectorSize> vector3;

int main() {

    std::iota(std::begin(vector1), std::end(vector1), 0);
    std::iota(std::begin(vector2), std::end(vector2), 0);

    auto tp1 = std::chrono::high_resolution_clock::now();

    for (size_t k{}; k < VectorSize;   k)
        vector3[k] = vector1[k]   vector2[k];

    auto tp2 = std::chrono::high_resolution_clock::now();
    std::cout << "\n\nDuration for adding: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n";

    return std::accumulate(std::begin(vector3), std::end(vector3), 0);
}

Shows (maybe not that that reliable) on my 10 years old Windows7 machine 162ms.

Which would be equivalent somehow to 1,62ns per one addition. New machines will be much faster.

I slightly doubt that any handwritten asm-code may outperform that . . .

Please publish the result of your measurement.

  • Related