Update: relevant GCC bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103798
I tested the following code:
#include <string_view>
size_t findFirstE_slow(std::string_view sv) {
return sv.find_first_of("eE");
}
size_t findFirstE_fast(std::string_view sv) {
auto it{sv.begin()};
for (; it != sv.end() && *it != 'e' && *it != 'E'; it)
;
return it == sv.end() ? std::string_view::npos : size_t(it - sv.begin());
}
quick-bench test: https://quick-bench.com/q/dSU3EBzI8MtGOFn_WLpK3ErT3ok
Compiler Explorer output: https://godbolt.org/z/eW3sx61vz
Both findFirstE_slow()
and firstFirstE_fast()
functions are meant to do the same thing, but findFirstE_slow()
runs significantly slower (at least 5x on the quick-bench test).
Here's the assembly output for x86-64 gcc (trunk) -std=c 20 -O3
.
findFirstE_slow():
.LC0:
.string "eE"
findFirstE_slow(std::basic_string_view<char, std::char_traits<char> >):
push r12
push rbp
push rbx
test rdi, rdi
je .L4
mov rbx, rdi
mov rbp, rsi
xor r12d, r12d
jmp .L3
.L8:
add r12, 1
cmp rbx, r12
je .L4
.L3:
movsx esi, BYTE PTR [rbp 0 r12]
mov edx, 2
mov edi, OFFSET FLAT:.LC0
call memchr
test rax, rax
je .L8
mov rax, r12
pop rbx
pop rbp
pop r12
ret
.L4:
mov r12, -1
pop rbx
pop rbp
mov rax, r12
pop r12
ret
findFirstE_fast():
findFirstE_fast(std::basic_string_view<char, std::char_traits<char> >):
add rdi, rsi
cmp rdi, rsi
je .L13
mov rax, rsi
jmp .L12
.L15:
add rax, 1
cmp rdi, rax
je .L13
.L12:
movzx edx, BYTE PTR [rax]
and edx, -33
cmp dl, 69
jne .L15
sub rax, rsi
ret
.L13:
mov rax, -1
ret
Interestingly, findFirstE_slow()
calls memchr("eE", *current_char, 2)
for every character in sv
.
On the other hand, findFirstE_fast()
does what we would reasonably expect, by comparing each character in sv
with 'e' and 'E'.
Clang generates similar output.
Question: Is there a missed optimization here for short strings like the one in my test? Am I missing something to get GCC to generate faster code?
CodePudding user response:
libstdc 's std::string_view::find_first_of
looks something like:
size_type find_first_of(std::string_view v, std::size_t pos = 0) {
if (v.empty()) return npos;
for (; pos < size(); pos) {
const char_type* p = traits_type::find(v.data(), v.size(), this->data()[pos]);
if (p) return pos;
}
return npos;
}
You can see how traits_type::find
gets transformed into memchr
.
The crux of the issue is that memchr("eE", this->data()[pos], 2) != nullptr
isn't compiled the same way as this->data()[pos] == 'e' || this->data()[pos] == 'E'
, even though the latter is much more efficient.
You can check this by trying to compile this:
constexpr unsigned char characters[] = "eE";
bool a(unsigned char* p) {
return __builtin_memchr(characters, *p, 2);
}
bool b(unsigned char* p) {
return *p == characters[0] || *p == characters[1];
}
This is a missed optimisation, but you can hint to the compiler to not use memchr with a custom traits type:
struct char_traits : std::char_traits<char> {
static constexpr const char_type* find(const char_type* p, std::size_t count, const char_type& ch) {
if (__builtin_constant_p(count) && count < 5) {
switch (count) {
case 0: return nullptr;
case 1: return ch == *p ? p : nullptr;
case 2: return ch == *p ? p : ch == * p ? p : nullptr;
case 3: return ch == *p ? p : ch == * p ? p : ch == * p ? p : nullptr;
case 4: return ch == *p ? p : ch == * p ? p : ch == * p ? p : ch == * p ? p : nullptr;
}
}
return std::char_traits<char>::find(p, count, ch);
}
};
using string_view = std::basic_string_view<char, char_traits>;
size_t findFirstE_slow(string_view sv) {
return sv.find_first_of(characters);
}
// Also your "fast" version needs to return
// return it == sv.end() ? string_view::npos : size_t(it - sv.begin());
// to be equivalent
(https://godbolt.org/z/bhPPxjboE)
And https://quick-bench.com/q/QVxVTxGEagUUCPuhFi9T8wjI1qQ says the slow version is now only 1.3x slower. Using a larger string (https://quick-bench.com/q/el0ukDywBNMoGsEb33PM_g4WUaY; 8000 chars before an 'e'
), the difference is mostly unnoticeable.
The main difference now is that one iterates over indices and the other over pointers (returning the difference at the end). The two differing instructions in the assembly are movzx edx, BYTE PTR [rsi rax]
and movzx edx, BYTE PTR [rax]
sub rax, rsi
, where you should find the second version is ever so slightly faster (especially asymptotically, since the subtraction happens outside the loop)