Why does the implementation of std::any use a function pointer function op codes, instead of a poi-CodePudding

Both the GCC and LLVM implementations of std::any store a function pointer in the any object and call that function with an Op/Action argument to perform different operations. Here is an example of that function from LLVM:

static void* __handle(_Action __act, any const * __this,
                          any * __other, type_info const * __info,
                          void const* __fallback_info)
    {
        switch (__act)
        {
        case _Action::_Destroy:
          __destroy(const_cast<any &>(*__this));
          return nullptr;
        case _Action::_Copy:
          __copy(*__this, *__other);
          return nullptr;
        case _Action::_Move:
          __move(const_cast<any &>(*__this), *__other);
          return nullptr;
        case _Action::_Get:
            return __get(const_cast<any &>(*__this), __info, __fallback_info);
        case _Action::_TypeInfo:
          return __type_info();
        }
        __libcpp_unreachable();
    }

Note: This is just one __handle function but there there are two such functions in each any implementation: one for small objects (Small buffer optimization) allocated within any and one for big objects allocated on the heap. Which one is used depends on the value of the function pointer stored in the any object.

The ability to choose one of two implementations at run-time and call a specific method from a pre-defined list of methods is essentially a manual implementation of a virtual table. I'm wondering why it was implemented this way. Wouldn't it have been easier to simply store a pointer to a virtual type?

I couldn't find any information about the reasons for this implementation. Thinking about it, I guess using a virtual class is sub-optimal in two ways:

It needs an object instance and managing a singleton, whereas in reality a vtable (without an instance) is enough.
Calling a function on any would involve two indirections: first through the pointer stored in any to get the vtable, then through the pointer stored in the vtable. I'm not sure if the performance of this is any different to the switch-based approach above.

Are these the reasons for using an implementation based on switch-ing op-codes? Is there any other major advantage of the current implementation? Do you know of a link to general information about this technique?

CodePudding user response：

Consider a typical use case of a std::any: You pass it around in your code, move it dozens of times, store it in a data structure and fetch it again later. In particular, you'll likely return it from functions a lot.

As it is now, the pointer to the single "do everything" function is stored right next to the data in the any. Given that it's a fairly small type (16 bytes on GCC x86-64), any fits into a pair of registers. Now, if you return an any from a function, the pointer to the "do everything" function of the any is already in a register or on the stack! You can just jump directly to it without having to fetch anything from memory. Most likely, you didn't even have to touch memory at all: You know what type is in the any at the point you construct it, so the function pointer value is just a constant that's loaded into the appropriate register. Later, you use the value of that register as your jump target. This means there's no chance for misprediction of the jump because there is nothing to predict, the value is right there for the CPU to consume.

In other words: The reason that you get the jump target for free with this implementation is that the CPU must have already touched the any in some way to obtain it in the first place, meaning that it already knows the jump target and can jump to it with no additional delay.

That means there really is no indirection to speak of with the current implementation if the any is already "hot", which it will be most of the time, especially if it's used as a return value.

On the other hand, if you use a table of function pointers somewhere in a read-only section (and let the any instance point to that instead), you'll have to go to memory (or cache) every single time you want to move or access it. The size of an any is still 16 bytes in this case but fetching values from memory is much, much slower than accessing a value in a register, especially if it's not in a cache. In a lot of cases, moving an any is as simple as copying its 16 bytes from one location to another, followed by zeroing out the original instance. This is pretty much free on any modern CPU. However, if you go the pointer table route, you'll have to fetch from memory every time, wait for the reads to complete, and then do the indirect call. Now consider that you'll often have to do a sequence of calls on the any (i.e. move, then destruct) and this will quickly add up. The problem is that you don't just get the address of the function you want to jump to for free every time you touch the any, the CPU has to fetch it explicitly. Indirect jumps to a value read from memory are quite expensive since the CPU can only retire the jump operation once the entire memory operation has finished. That doesn't just include fetching a value (which is potentially quite fast because of caches) but also address generation, store forwarding buffer lookup, TLB lookup, access validation, and potentially even page table walks. So even if the jump address is computed quickly, the jump won't retire for quite a long while. In general, "indirect-jump-to-address-from-memory" operations are among the worst things that can happen to a CPU's pipeline.

TL;DR: As it is now, returning an any doesn't stall the CPU's pipeline (the jump target is already available in a register so the jump can retire pretty much immediately). With a table-based solution, returning an any will stall the pipeline twice: Once to fetch the address of the move function, then another time to fetch the destructor. This delays retirement of the jump quite a bit since it'll have to wait not only for the memory value but also for the TLB and access permission checks.

Code memory accesses, on the other hand, aren't affected by this since the code is kept in microcode form anyway (in the µOp cache). Fetching and executing a few conditional branches in that switch statement is therefore quite fast (and even more so when the branch predictor gets things right, which it almost always does).