Fast CRC with PCLMULQDQ *NOT* reflected-CodePudding

I'm trying to write a PCLMULQDQ-optimized CRC-32 implementation. The specific CRC-32 variant is for one that I don't own, but am trying to support in library form. In crcany model form, it has the following parameters:

width=32 poly=0xaf init=0xffffffff refin=false refout=false xorout=0x00000000 check=0xa5fd3138 (Omitted residue which I believe is 0x00000000 but honestly don't know what it is)

A basic non-table-based/bitwise implementation of the algorithm (as generated by crcany) is:

uint32_t crc32byond_bit(uint32_t crc, void const *mem, size_t len) {
    unsigned char const *data = mem;
    if (data == NULL)
        return 0xffffffff;
    for (size_t i = 0; i < len; i  ) {
        crc ^= (uint32_t)data[i] << 24;
        for (unsigned k = 0; k < 8; k  ) {
            crc = crc & 0x80000000 ? (crc << 1) ^ 0xaf : crc << 1;
        }
    }
    return crc;
}

First off, I don't understand the paper that describes the algorithm, at a fundamental level. I don't know what something like K1 = [x^(512 64) mod P(x)] means. (What is x? Where did it come from? I don't know.) I'm a professional software engineer; not an academic. Frankly, I don't understand the notation at all, and the Wikipedia article didn't do much for me. Maybe it's my weakness in linear algebra coming back to haunt me.

I know of a few public implementations I hoped to leech off:

But:

They use the bit-reflected algorithm, and I'm not sure how to create the non-reflected variant.
They don't seem to follow the paper? For example, wuffs and crc32fast especially call out that they don't use K6, but the paper makes K6 seem necessary.

I found an Intel implementation in soft-crc but it doesn't appear to use the same constants (K4-K6? μ?)

/**
 * PCLMULQDQ CRC computation context structure
 */
struct crc_pclmulqdq_ctx {
        /**
         * K1 = reminder X^128 / P(X) : 0-63
         * K2 = reminder X^192 / P(X) : 64-127
         */
        uint64_t k1;
        uint64_t k2;

        /**
         * K3 = reminder X^64 / P(X) : 0-63
         * q  = quotient X^64 / P(X) : 64-127
         */
        uint64_t k3;
        uint64_t q;

        /**
         * p   = polynomial / P(X) : 0-63
         * res = reserved : 64-127
         */
        uint64_t p;
        uint64_t res;
};

I believe the constants I need for poly 0xAF are:

Px  = 0x1_0000_00AF
k1  = 0x0_5B5A_E0C7
k2  = 0x0_1BD8_1099
k3  = 0x0_1157_936A
k4  = 0x0_1010_1111
k5  = 0x0_0029_5F23
k6  = 0x0_0000_4455
μ   = 0x1_0000_00AF

(source: print-crc32-x86-sse42-magic-numbers.go with const px = "100000000000000000000000010101111", or 0xaf | (1 << 32))

I'm hoping for help in either

Understanding the notation and formulas used in the article (and why implementations seem to diverge from the article?),
Converting an existing implementation for the non-reflected variant, or maybe
Some pseudo-code?

CodePudding user response：

I have 6 sets of code for 16, 32, 64 bit crc, non-reflected and reflected here. The code is setup for Visual Studio. Comments have been added to the constants which were missing from Intel's github site.

https://github.com/jeffareid/crc

32 bit non-relfected is here:

https://github.com/jeffareid/crc/tree/master/crc32f

You'll need to change the polynomial in crc32fg.cpp, which generates the constants. The polynomial you want is actually:

0x00000001000000afull

what is x?

A CRC uses polynomials with 1 bit coefficients. For example 0x0B is really x^3 x 1.

The XMM registers can read|write 16 bytes | 128 bits at a time, but PCLMULQDQ can only carryless multiply two 64 bit operands to produce a 128 bit product. So the 128 bits are split up into two 64 bit operands, then each operand is multiplied by a constant to "fold" it forwards. Since the XMM registers can operate somewhat in parallel, 8 XMM registers are used to read | write 128 bytes | 1024 bits at a time. Each folding step "advances" 16 bytes | 128 bits of data forwards 128 bytes | 1024 bits, by multiplying it by constants. The lower 64 bits are multiplied by x^(1024) mod poly to create a 128 bit product that is "advanced" by 1024 bits. The upper 64 bits are multiplied by x^(1024 64) mod poly to create a 128 bit product that is advanced byte 1024 64 bits (the 64 is needed because the product is based on the upper 64 bits of 128 bits of data). The two 128 bit products are xor'ed together, and then later xor'red with data 128 bytes | 1024 bits later in the buffer.

Note that the example in the Intel document uses 4 XMM registers to advance data 64 bytes | 512 bits at a time, but the github examples I've seen and the examples I use in my github repository use 8 XMM registers and advance 128 bytes | 1024 bits at a time. For the relatively small number of processors that support AVX512 | ZMM registers, there are examples that advance 256 bytes | 2048 bits at a time. I don't own a computer with AVX512, so I don't have any code for that.

Since XMM read|writes are little endian, PSHUFB is used to reverse the bytes after each read.

The code is mostly based on 64 bit CRC code, and for a 32 bit CRC, this is handled by shifting most of the constants left 32 bits, essentially a 64 bit polynomial where the lower 32 bits are all zero. This requires adjusting the constants so instead of x^(a) mod poly, it's (x^(a-32) mod poly)<<32.

The folding process doesn't produce a CRC, it just transforms data to advance it. Eventually the code will reach the point with label _128_done: . At this point there are 128 bits of data, and since the code is based on 64 bit CRC logic, 64 bit's of zeroes are logically appended, so in effect it's a 192 bit value where the imaginary lower 64 bits are all zero. The upper 64 bits are folded forward by 128 bits, resulting in a 128 bit value in XMM7 ready to calculate a 64 bit CRC, but since this is a 32 bit CRC, XMM7 has 96 bits of data shifted left 32 bits. The upper 32 bits are folded forwards by 64 bits (in this case the 96 bit value and rk06 are both shifted left by 32 bits, so rk06 in this case folds by 64 bits (x^64 mod poly) and shifts left by 32 bits. The result is a 64 bit value shifted left 32 bits in XMM7.

The quotient of a 64 bit number divided by a 33 bit polynomial only needs the upper 32 bits of the 64 bit number, so the left shifted 64 bit value has the upper 32 bits where it will be convenient to calculate a quotient. The divide is really a multiply by x^64 / polynomial, PCLMULQDQ will just specify to use the upper 64 bit part of XMM7 to use the upper 32 bits of the left shifted 64 bit number. The actual CRC calculation is based on this logic:

quotient = upper 32 bits of 64 bit value / polynomial
product  = quotient * polynomial
CRC      = 64 bit value XOR product

The division is done by multiplying by the inverse: x^64 / poly. Since the poly and it's inverse are 33 bits, they can't be shifted left 32 bits, so the code shifts the product left 4 bytes after each multiply. The CRC ends up in bits 32 to 64 of XMM7 and pextrd eax,xmm7,1 is used to get those 32 bits.

I modified crc32fg.cpp to use CRC polynomial 0x1000000af, and this is what I get. Note there's no apparent reasoning behind the ordering of the constants. I only added comments to explain what they represent.

rk01    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk02    dq      0fafa517900000000h      ; x^(32* 5) mod P(x) << 32
rk03    dq      05cd86bb500000000h      ; x^(32*31) mod P(x) << 32
rk04    dq      0af6f37a300000000h      ; x^(32*33) mod P(x) << 32
rk05    dq      000295f2300000000h      ; x^(32* 3) mod P(x) << 32
rk06    dq      00000445500000000h      ; x^(32* 2) mod P(x) << 32
rk07    dq      000000001000000afh      ; floor(2^64/P(x))
rk08    dq      000000001000000afh      ; P(x)
rk09    dq      09bd57b5d00000000h      ; x^(32*27) mod P(x) << 32
rk10    dq      0b7a4d76400000000h      ; x^(32*29) mod P(x) << 32
rk11    dq      01ae0004200000000h      ; x^(32*23) mod P(x) << 32
rk12    dq      0e7720be600000000h      ; x^(32*25) mod P(x) << 32
rk13    dq      09c7fc8fe00000000h      ; x^(32*19) mod P(x) << 32
rk14    dq      03885faf800000000h      ; x^(32*21) mod P(x) << 32
rk15    dq      0b477ad7100000000h      ; x^(32*15) mod P(x) << 32
rk16    dq      00ac2ae3d00000000h      ; x^(32*17) mod P(x) << 32
rk17    dq      05eae9dbe00000000h      ; x^(32*11) mod P(x) << 32
rk18    dq      0784a483800000000h      ; x^(32*13) mod P(x) << 32
rk19    dq      07d21bf2000000000h      ; x^(32* 7) mod P(x) << 32
rk20    dq      0faebd3d300000000h      ; x^(32* 9) mod P(x) << 32