What does the at (@) symbol do in numpy's vectorized C kernels?-CodePudding

Numpy has an amazing CPU dispatch mechanism that allows it to work out which instruction set a CPU uses at runtime. It uses that info to then "slot"/choose an optimized kernel for the requested call, i.e., machines with AVX support run an AVX optimized loop, machines with just SSE2 use SSE2, etc.

I'm trying to work out how this works under the hood, so I am reading numpy's source code (which is python, C, and C ), and I have encountered something I can't make sense of:

// copied from numpy/core/src/umath/loops_minmax.dispatch.c.src

// contiguous input.
static inline void
simd_reduce_c_@intrin@_@sfx@(const npyv_lanetype_@sfx@ *ip, npyv_lanetype_@sfx@ *op1, npy_intp len)
{
    if (len < 1) {
        return;
    }
    const int vstep = npyv_nlanes_@sfx@;
    const int wstep = vstep*8;
    npyv_@sfx@ acc = npyv_setall_@sfx@(op1[0]);
    for (; len >= wstep; len -= wstep, ip  = wstep) {
    #ifdef NPY_HAVE_SSE2
        NPY_PREFETCH(ip   wstep, 0, 3);
    #endif
        npyv_@sfx@ v0 = npyv_load_@sfx@(ip   vstep * 0);
        npyv_@sfx@ v1 = npyv_load_@sfx@(ip   vstep * 1);
        npyv_@sfx@ v2 = npyv_load_@sfx@(ip   vstep * 2);
        npyv_@sfx@ v3 = npyv_load_@sfx@(ip   vstep * 3);

        npyv_@sfx@ v4 = npyv_load_@sfx@(ip   vstep * 4);
        npyv_@sfx@ v5 = npyv_load_@sfx@(ip   vstep * 5);
        npyv_@sfx@ v6 = npyv_load_@sfx@(ip   vstep * 6);
        npyv_@sfx@ v7 = npyv_load_@sfx@(ip   vstep * 7);

        npyv_@sfx@ r01 = V_INTRIN(v0, v1);
        npyv_@sfx@ r23 = V_INTRIN(v2, v3);
        npyv_@sfx@ r45 = V_INTRIN(v4, v5);
        npyv_@sfx@ r67 = V_INTRIN(v6, v7);
        acc = V_INTRIN(acc, V_INTRIN(V_INTRIN(r01, r23), V_INTRIN(r45, r67)));
    }
    for (; len >= vstep; len -= vstep, ip  = vstep) {
        acc = V_INTRIN(acc, npyv_load_@sfx@(ip));
    }
    npyv_lanetype_@sfx@ r = V_REDUCE_INTRIN(acc);
    // Scalar - finish up any remaining iterations
    for (; len > 0; --len,   ip) {
        const npyv_lanetype_@sfx@ in2 = *ip;
        r = SCALAR_OP(r, in2);
    }
    op1[0] = r;
}

What does the @ symbol do here? As far as I know, this is not a valid character in C, so I am a bit lost. I assume it is templating magic that replaces, e.g., @sfx@ with a suffix for a vector extension, but how does this work?

CodePudding user response：

As the comments already suspected, this is indeed a substitution/templating mechanism. This time, however, it is a "homebrew" that lives inside numpy's distutils module (the module that is responsible for building numpy for python<3.12).

For the most part, it allows looping and variable substitution. A repeated block is declared using /** begin repeat ... */ and /*end repeat*/. A variable is declared within the block comment that defines the loop using * #var = 1,2,3, .... Nested loops are also supported and are identified using a number after repeat, e.g. /**repeat1 ... for the first nested loop. Inside a loop, the phrase @var@ is then substituted with the respective variable.

An example source file like the following

/**begin repeat
* #a = 1,2,3#
* #b = 1,2,3#
*/
/**begin repeat1
* #c = ted, jim#
*/
@a@, @b@, @c@
/**end repeat1**/
/**end repeat**/

processed with the following command

from numpy.distutils.conv_template import process_file
from pathlib import Path

generated = process_file(

# write somewhere
Path("test.src").write_text(generated)

produces:

#line 1 "test.c.src"

/*
 *****************************************************************************
 **       This file was autogenerated from a template  DO NOT EDIT!!!!      **
 **       Changes should be made to the original source (.src) file         **
 *****************************************************************************
 */

#line 1
#line 5
#line 8
1, 1, ted

#line 8
1, 1, jim


#line 5
#line 8
2, 2, ted

#line 8
2, 2, jim


#line 5
#line 8
3, 3, ted

#line 8
3, 3, jim

Going back to the snippet I shared in my question, the relevant bit to disentangle the code is:

/**begin repeat
 * #sfx = s8, u8, s16, u16, s32, u32, s64, u64, f32, f64#
 * #simd_chk = NPY_SIMD*8, NPY_SIMD_F32, NPY_SIMD_F64#
 * #is_fp = 0*8, 1, 1#
 * #scalar_sfx = i*8, f, d#
 */
/**begin repeat1
 * # intrin = max, min, maxp, minp#
 * # fp_only = 0, 0, 1, 1#
 */

a bit higher up in the same file.

Essentially, this means that the template will generate a whooping 40 reduction loops for various combinations of dtypes (@sfx@), and reduction operators (@intrin@). (Note: The result is further substituted during the preprocessor stage where macros like npyv_load_f64 are replaced by snippets that perform the requested operation depending on the instruction set that will be compiled for.)