Missed optimization with string_view::find_first

Update: relevant GCC bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103798

I tested the following code:

#include <string_view>

size_t findFirstE_slow(std::string_view sv) {
  return sv.find_first_of("eE");
}

size_t findFirstE_fast(std::string_view sv) {
  auto it{sv.begin()};
  for (; it != sv.end() && *it != 'e' && *it != 'E';   it)
    ;
  return it == sv.end() ? std::string_view::npos : size_t(it - sv.begin());
}

quick-bench test: https://quick-bench.com/q/dSU3EBzI8MtGOFn_WLpK3ErT3ok

Compiler Explorer output: https://godbolt.org/z/eW3sx61vz

Both findFirstE_slow() and firstFirstE_fast() functions are meant to do the same thing, but findFirstE_slow() runs significantly slower (at least 5x on the quick-bench test).

Here's the assembly output for x86-64 gcc (trunk) -std=c 20 -O3.

findFirstE_slow():

.LC0:
        .string "eE"
findFirstE_slow(std::basic_string_view<char, std::char_traits<char> >):
        push    r12
        push    rbp
        push    rbx
        test    rdi, rdi
        je      .L4
        mov     rbx, rdi
        mov     rbp, rsi
        xor     r12d, r12d
        jmp     .L3
.L8:
        add     r12, 1
        cmp     rbx, r12
        je      .L4
.L3:
        movsx   esi, BYTE PTR [rbp 0 r12]
        mov     edx, 2
        mov     edi, OFFSET FLAT:.LC0
        call    memchr
        test    rax, rax
        je      .L8
        mov     rax, r12
        pop     rbx
        pop     rbp
        pop     r12
        ret
.L4:
        mov     r12, -1
        pop     rbx
        pop     rbp
        mov     rax, r12
        pop     r12
        ret

findFirstE_fast():

findFirstE_fast(std::basic_string_view<char, std::char_traits<char> >):
        add     rdi, rsi
        cmp     rdi, rsi
        je      .L13
        mov     rax, rsi
        jmp     .L12
.L15:
        add     rax, 1
        cmp     rdi, rax
        je      .L13
.L12:
        movzx   edx, BYTE PTR [rax]
        and     edx, -33
        cmp     dl, 69
        jne     .L15
        sub     rax, rsi
        ret
.L13:
        mov     rax, -1
        ret

Interestingly, findFirstE_slow() calls memchr("eE", *current_char, 2) for every character in sv. On the other hand, findFirstE_fast() does what we would reasonably expect, by comparing each character in sv with 'e' and 'E'.

Clang generates similar output.

Question: Is there a missed optimization here for short strings like the one in my test? Am I missing something to get GCC to generate faster code?

CodePudding user response：

libstdc 's std::string_view::find_first_of looks something like:

size_type find_first_of(std::string_view v, std::size_t pos = 0) {
    if (v.empty()) return npos;
    for (; pos < size();   pos) {
        const char_type* p = traits_type::find(v.data(), v.size(), this->data()[pos]);
        if (p) return pos;
    }
    return npos;
}

You can see how traits_type::find gets transformed into memchr.

The crux of the issue is that memchr("eE", this->data()[pos], 2) != nullptr isn't compiled the same way as this->data()[pos] == 'e' || this->data()[pos] == 'E', even though the latter is much more efficient.

You can check this by trying to compile this:

constexpr unsigned char characters[] = "eE";

bool a(unsigned char* p) {
    return __builtin_memchr(characters, *p, 2);
}

bool b(unsigned char* p) {
    return *p == characters[0] || *p == characters[1];
}

This is a missed optimisation, but you can hint to the compiler to not use memchr with a custom traits type:

struct char_traits : std::char_traits<char> {
    static constexpr const char_type* find(const char_type* p, std::size_t count, const char_type& ch) {
        if (__builtin_constant_p(count) && count < 5) {
            switch (count) {
                case 0: return nullptr;
                case 1: return ch == *p ? p : nullptr;
                case 2: return ch == *p ? p : ch == *  p ? p : nullptr;
                case 3: return ch == *p ? p : ch == *  p ? p : ch == *  p ? p : nullptr;
                case 4: return ch == *p ? p : ch == *  p ? p : ch == *  p ? p : ch == *  p ? p : nullptr;
            }
        }
        return std::char_traits<char>::find(p, count, ch);
    }
};

using string_view = std::basic_string_view<char, char_traits>;

size_t findFirstE_slow(string_view sv) {
  return sv.find_first_of(characters);
}

// Also your "fast" version needs to return
//    return it == sv.end() ? string_view::npos : size_t(it - sv.begin());
// to be equivalent

(https://godbolt.org/z/bhPPxjboE)

And https://quick-bench.com/q/QVxVTxGEagUUCPuhFi9T8wjI1qQ says the slow version is now only 1.3x slower. Using a larger string (https://quick-bench.com/q/el0ukDywBNMoGsEb33PM_g4WUaY; 8000 chars before an 'e'), the difference is mostly unnoticeable.

The main difference now is that one iterates over indices and the other over pointers (returning the difference at the end). The two differing instructions in the assembly are movzx edx, BYTE PTR [rsi rax] and movzx edx, BYTE PTR [rax] sub rax, rsi, where you should find the second version is ever so slightly faster (especially asymptotically, since the subtraction happens outside the loop)