I recently made a program with C and ASM. Can anyone help me make this code a more efficient one , in the ASM part or both. I would really appreciate it because i dont know every asm instriction and probably i am using way too many. BTW the program sums two integer vectors with any size. The code that i have is the one above:
C :
extern "C" {
int add_vtr_asm(int*, int*, int*, int);
}
void add_vtr() {
__declspec(align(16))
int vetor1[1024];
__declspec(align(16))
int vetor2[1024];
__declspec(align(16))
int soma[1024];
for (i = 0; i <= 1023; i ) {
vetor1[i] = i;
vetor2[i] = i;
}
add_vtr_asm(vetor1, vetor2, soma, 1024);
for (i = 0; i <= 1023; i ) {
printf("% d % d = % d \n",vetor1[i] ,vetor2[i], soma[i]);
}
exit(0);
}
int main()
{
printf("Programa para somar vetores de inteiros: \n");
printf("Soma de vetores com % d elementos \n", 1024);
add_vtr();
}
ASM:
.MODEL FLAT, C
.CODE
add_vtr_asm PROC
push ebp
mov ebp,esp
push esi
push edi
mov esi,[ebp 8]
mov ebx, [ebp 12]
mov edi, [ebp 16]
mov ecx,[ebp 20]
shr ecx,2
next:movdqa XMM0,[esi]
add esi,16
paddd xmm0,[ebx]
add ebx,16
movdqa [edi],xmm0
add edi,16
dec ecx
jnz next
pop edi
pop esi
pop ebp
ret
add_vtr_asm ENDP
END
CodePudding user response:
I just coded it up in simple c and get this: https://godbolt.org/z/P1zPWv65b
struct Vec {
alignas(16) int data[1024];
};
Vec add(const Vec &v1, const Vec &v2) {
Vec v;
for (int i = 0; i < 1024; i) {
v.data[i] = v1.data[i] v2.data[i];
}
return v;
}
Compiling with: g -std=c 20 -O2 -W -Wall
add(Vec const&, Vec const&):
xor eax, eax
.L2:
movdqa xmm0, XMMWORD PTR [rsi rax]
paddd xmm0, XMMWORD PTR [rdx rax]
add rax, 16
movaps XMMWORD PTR [rax-16 rdi], xmm0
cmp rax, 4096
jne .L2
mov rax, rdi
ret
I don't see how I can improve that. That's not even vectorized, the plain optimization already picks SIMD just fine.
Note: without alignas(16)
or std::array<int, 1024>
the compiler uses movdqu
which I assume might be slower. But I didn't test that. It's likely the whole code is limited by the memory bandwidth.
Adding -funroll-loops
makes the loop do 128 bytes at a time using xmm0-xmm7 but you have to measure yourself if that is better.
The c/c code is much more readable and will work on any cpu. A simple vector addition is so simple for the compiler to optimize that writing it in asm can only make it worse.
CodePudding user response:
Let us start with some benchmark here. Simplest possible C solution:
#include <iostream>
#include <chrono>
#include <array>
#include <numeric>
#include <iterator>
constexpr size_t VectorSize = 100'000'000u;
std::array<int, VectorSize> vector1;
std::array<int, VectorSize> vector2;
std::array<int, VectorSize> vector3;
int main() {
std::iota(std::begin(vector1), std::end(vector1), 0);
std::iota(std::begin(vector2), std::end(vector2), 0);
auto tp1 = std::chrono::high_resolution_clock::now();
for (size_t k{}; k < VectorSize; k)
vector3[k] = vector1[k] vector2[k];
auto tp2 = std::chrono::high_resolution_clock::now();
std::cout << "\n\nDuration for adding: " << std::chrono::duration_cast<std::chrono::milliseconds>(tp2 - tp1).count() << "ms\n\n";
return std::accumulate(std::begin(vector3), std::end(vector3), 0);
}
Shows (maybe not that that reliable) on my 10 years old Windows7 machine 162ms.
Which would be equivalent somehow to 1,62ns per one addition. New machines will be much faster.
I slightly doubt that any handwritten asm-code may outperform that . . .
Please publish the result of your measurement.