Optimize lookup tables to simple ALU-CodePudding

Question

Say you have a simple function that returns a value based on a look table for example:

See edit about assumptions.

uint32_t
lookup0(uint32_t r) {
    static const uint32_t tbl[] = { 0, 1, 2, 3 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r`.  */
    return tbl[r];
}


uint32_t
lookup1(uint32_t r) {
    static const uint32_t tbl[] = { 0, 0, 1, 1 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r / 2`.  */
    return tbl[r];
}

Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.

Motivation

The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.

Looking at how modern compilers do:

Neither clang or gcc are able to do this at the moment.

Clang is able to get lookup0 if you rewrite it as a switch/case statement.

lookup0(unsigned int):                            # @lookup0(unsigned int)
        movl