Can gcc emit code as efficient as clang for the binary tree "LowerBound" algorithm?-CodePudding

I have been implementing various node based binary search trees using C-ish C code. When benchmarking these I have noticed surprisingly large performance variations both across compilers and in response to small code changes.

When I focused on insertion and removal in a tree that allowed duplicates (as a C std::multiset<int> would), I found that almost all the time is spent zig-zagging down the tree's left and right pointers in operations like "find" and "lower_bound" rather than the conceptually "expensive" rebalancing steps that occur after inserts and deletes.

So I began to focus on one case in particular: lower bound.

// Node is a binary tree node.  It has the
// usual left and right links and an
// integral key.
struct Node {
    int key;
    Node* links[2];
};

// LowerBound returns the first node in
// the tree rooted at "x" whose key is
// not less than "key", or null if there
// is no such key.
Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    bool x_gte = !(x->key < key);
    lower = x_gte ? x : lower;
    x = x->links[!x_gte];
  }
  return lower;
}

A few points and observations:

I am on an AMD Ryzen 9 5900X 12-Core. My understanding is that the conditional move (cmov) instructions are faster on AMD than on Intel, but I find that when I spot check results on my 8 year old Intel laptop the code that is faster on AMD is faster on Intel too.
I am running Linux. I've turned off hyperthreading, boost mode, and set the cpu scaling governor to "performance" using this script I wrote. The performance numbers are stable with little variation.
The code above is the end of several optimization iterations. I have a benchmark (code here) that exercises various tree sizes, allocating nodes in an array according to either a random or ascending by key order, then writes a key access pattern to another array, and runs through them repeatedly. The key access patterns are either ascending or random. In larger trees, code that uses branches, rather than cmov or similar, is often much slower.
One key optimization seems to be using an array of links (Node links[2]) in the node instead of explicit left and right pointers. With explicit fields gcc is very quick to switch to branchy code, which is slower. With the links array gcc will index it as I have written.
In fact, when I use gcc's profile guided optimization it still switches to branch based code, for a 1.5x to 2x performance loss.
In all cases, except for very tiny trees where branchy code can win, clang generates faster code for this function.

With key < key); lower = x_gte ? x : lower; x = x->links[!!x_gte]; } return lower; }'),l:'5',n:'0',o:'C++ source #1',t:'0')),k:42.61119081779054,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:g122,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'1',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:c++,libs:!((name:benchmark,ver:'161')),options:'-O3 -std=c++20',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'x86-64 gcc 12.2 (C++, Editor #1, Compiler #1)',t:'0')),k:48.77145619245085,l:'4',m:51.37594799566631,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang1400,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'1',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:2,lang:c++,libs:!(),options:'-O3 -std=c++20',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'x86-64 clang 14.0.0 (C++, Editor #1, Compiler #2)',t:'0')),header:(),l:'4',m:48.62405200433369,n:'0',o:'',s:0,t:'0')),k:57.38880918220949,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4" rel="nofollow noreferrer">the code above on godbolt we can see clang generating the following:

LowerBound(Node*, int):
        xorl