Should I pass __int128_t by reference or value?-CodePudding

Question> What is the recommended way to pass __int128_t as a function parameter?

Thank you

#include <iostream>
bool CheckInt(const __int128_t& large_number)
{
    return large_number > 10000; // Just for Demo
}

bool CheckInt2(__int128_t large_number)
{
    return large_number > 10000;
}

int main()
{
    __int128_t abc = 20000;    
    std::cout<< CheckInt(abc) << std::endl;
    std::cout<< CheckInt2(abc) << std::endl;

    return 0;
}

CodePudding user response：

Let's look at four scenarios.

These were compiled by gcc for an 64 bit x86 architecture, there should be similar results for different compilers.

How the functions are compiled:

bool by_value(__int128 large_number) {
    return large_number > 10000;
}

bool by_reference(const __int128& large_number) {
    return large_number > 10000;
}

And we can see the x86 assembler output here https://godbolt.org/z/v9cM8xj35

by_value(__int128):
        mov     eax, 10000
        cmp     rax, rdi  # Use first 8 bytes
        mov     eax, 0
        sbb     rax, rsi  # Use second 8 bytes
        setl    al
        ret
by_reference(__int128 const&):
        mov     eax, 10000
        cmp     rax, QWORD PTR [rdi]    # Use first 8 bytes
        mov     eax, 0
        sbb     rax, QWORD PTR [rdi 8]  # Use second 8 bytes
        setl    al
        ret

The commented lines are the only lines that differ.

This is showing the calling convention of the platform: The first 8 bytes of arguments are stored in rdi, the second 8 bytes in rsi.

When you pass by value, large_number will be stored in these two registers, and can be used quickly and efficiently.

When you pass by reference, only one register is used to pass a pointer to the value (rdi), and to access the first 8 bytes the dereference QWORD PTR [rdi] is used, and the second 8 bytes with QWORD PTR [rdi 8] (some pointer arithmetic).

Passing by value will win out in most situations here. If you have a lot of arguments or local variables in your functions, the registers used to store large_number may "spill" onto the stack, so theoretically passing by value would need to do more work. But it would probably spill if there was a one-register pointer or a two-register 16-byte value, so there shouldn't be much difference in practice.

Calling the function with an existing __int128 variable:

bool by_value(__int128);
bool by_reference(const __int128&);

extern __int128 x;

extern bool call_by_value() {
    return by_value(x);
}

extern bool call_by_reference() {
    return by_reference(x);
}

https://godbolt.org/z/7sT8b33Ez

call_by_value():
        mov     rdi, QWORD PTR x[rip]
        mov     rsi, QWORD PTR x[rip 8]
        jmp     by_value(__int128)
call_by_reference():
        mov     edi, OFFSET FLAT:x
        jmp     by_reference(__int128 const&)

It may look like more work needs to be done in the by-value case: To call by-reference, you only need to the address of x (OFFSET FLAT:x) into edi and call the function, whereas in the by-value case the value of x needs to be read into the two registers then the function can be called.

However, recall that by_reference will have to indirect through the pointer to use it. So the by reference is hiding the x[rip] and x[rip 8] inside the function, and there isn't much difference.

Calling the function with some constant value (or something that optimizes to it):

bool call_by_value() {
    __int128 abc = 20000;
    return by_value(abc);
}

bool call_by_reference() {
    __int128 abc = 20000;
    return by_reference(abc);
}

https://godbolt.org/z/6jhEWfh6a

call_by_value():
        mov     edi, 20000  # Stores 2000 into the first register
        xor     esi, esi    # Stores 0 into the second register
        jmp     by_value(__int128)
call_by_reference():
        sub     rsp, 24
        mov     rdi, rsp  # Store current stack pointer (which will point to abc)
        mov     QWORD PTR [rsp], 20000  # Store first 8 bytes on stack
        mov     QWORD PTR [rsp 8], 0    # Store second 8 bytes on the stack
        call    by_reference(__int128 const&)
        add     rsp, 24
        ret

Calling by reference needs to do a lot: The value has to be allocated onto the stack and then a pointer to it is passed to the function.

Calling by value can just stores the value into the two registers and calls the function.

Calling the function with a runtime calculated prvalue (here the "calculation" is just a copy)

bool call_by_value() {
    return by_value( x);
}

bool call_by_reference() {
    return by_reference( x);
}

https://godbolt.org/z/vqdGEeGY9

call_by_value():
        mov     rdi, QWORD PTR x[rip]
        mov     rsi, QWORD PTR x[rip 8]
        jmp     by_value(__int128)
call_by_reference():
        sub     rsp, 24
        movdqa  xmm0, XMMWORD PTR x[rip]  # Store the value of x into a 16 byte register
        mov     rdi, rsp                  # Store current stack pointer
        movaps  XMMWORD PTR [rsp], xmm0   # Write 16 bytes to the stack pointer
        call    by_reference(__int128 const&)
        add     rsp, 24
        ret

So to pass the result of a calculation, in the by-value case the calculation can directly be done on registers. In the by-reference case, the value needs to be calculated and then stored on to the stack and then a pointer needs to be passed.

There is one more issue: When you have extern bool by_reference(const __int128&);, and you don't have whole program optimisation or link time optimization, the compiler can't know that passing to by_reference doesn't modify the value it is passed. After all, it could look like:

bool by_reference(const __int128& large_number) {
    const_cast<__int128&>(large_number) = 0;
}

This can disable some further optimizations.

All in all, it is better in most cases to pass by value. On other architectures, the default calling convention may be to pass 16 byte arguments on the stack, which would make both cases not too different.

Some people will say that you should only pass something the size of a pointer or smaller by value, and everything else should be passed by reference. However, this fails to account for how much faster registers are than the stack.

This was based on the analysis of the assembler, not on actual timings. You would probably have to call a function many, many times for this to make a difference.