Home > Mobile >  Optimization of image resizing (method Nearest) with using SIMD
Optimization of image resizing (method Nearest) with using SIMD

Time:12-07

I know that 'Nearest' method of image resizing is the fastest method. Nevertheless I search way to speed up it. Evident step is a precalculate indices:

template<int colors> inline void CopyPixel(const uint8_t* src, uint8_t* dst)
{
    for (int i = 0; i < colors;   i)
        dst[i] = src[i];
}

template<int colors> void Resize(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    for (int dy = 0; dy < dstH; dy  )
    {
        const uint8_t * srcY = src   idxY[dy] * srcW * colors;
        for (int dx = 0, offset = 0; dx < dstW; dx  , offset  = colors)
            CopyPixel<N>(srcY   idxX[dx], dst   offset);
        dst  = dstW * colors;
    }
}

Are the next optimization steps exist? For example with using SIMD or some other optimization technic.

P.S. Especially I am interesting in optimization of RGB (Colors = 3). And if I use current code I see that ARGB image (Colors = 4) is processing faster for 50% then RGB despite that it bigger for 30%.

CodePudding user response:

The speed problem in (SIMD-based) resize algorithms comes from the mismatch of indexing input and output elements. When e.g. the resize factor is 6/5, one needs to consume 6 pixels and write 5. OTOH SIMD register width of 16 bytes maps to either 16 grayscale elements, 4 RGBA-elements or 5.33 RGB-elements.

My experience is that a sufficiently good performance (maybe not optimal, but often beating opencv and other freely available implementations) comes when trying to write 2-4 SIMD registers worth of data at a time, reading the required number of linear bytes from the input some, and using pshufb in x86 SSSE3 or vtbl in Neon to gather load from the registers -- never from memory. One needs of course a fast mechanism to either calculate the LUT indices inline, or to precalculate the indices, which are shared between different output rows.

One should prepare to have several inner kernels depending on the input/output ratio of the (horizontal) resolution.

RGBrgbRGBrgbRGBr|gbRGBrgb ....  <- input
                         ^ where to load next 32 bytes of input
RGBRGBrgbRGBrgbr|gbRGBrgbRGBRGBrg| <- 32 output bytes, from 

0000000000000000|0000001111111111| <- high bit of index
0120123456789ab9|abcdef0123423456| <- low 4 bits of index

Notice, that one can handle with the LUT method all channel counts

// inner kernel for downsampling between 1x and almost 2x*
// - we need to read max 32 elements and write 16
void process_row_ds(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto b = load16_bytes(input   16);
       auto c = load16_bytes(indices);
       a = lut32(a,b,c);      // get 16 bytes out of 32
       store16_bytes(output, a);
       output  = 16;
       input  = *advances  ;
    } while (out_width--);  // multiples of 16...
}

// inner kernel for upsampling between 1x and inf
void process_row_us(uint8_t const *input, uint8_t const *indices,
                 int const *advances, uint8_t *output, int out_width) {
    do {
       auto a = load16_bytes(input);
       auto c = load16_bytes(indices);
       a = lut16(a, c);      // get 16 bytes out of 16
       store16_bytes(output, a);
       output  = 16;
       input  = *advances  ;
    } while (out_width--);
}    

I would also encourage to use some elementary filtering for downsampling, such as gaussian binomial kernels (1 1, 1 2 1, 1 3 3 1, 1 4 6 4 1, ...) along with hierarchical downsampling in addition to (at least) bilinear interpolation. It's of course possible that the application will tolerate aliasing artifacts -- the cost AFAIK is often not that large, especially given that otherwise the algorithm will be memory bound.

CodePudding user response:

It’s possible to use SIMD, and I’m pretty sure it will help, unfortunately it’s relatively hard. Below is a simplified example which only supports image enlargements but not shrinking.

Still, I hope it might be useful as a starting point.

Both MSVC and GCC compile the hot loop in LineResize::apply method into 11 instructions. I think 11 instructions for 16 bytes should be faster than your version.

#include <stdint.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <vector>
#include <array>
#include <assert.h>
#include <stdio.h>

// Implements nearest neighbor resize method for RGB24 or BGR24 bitmaps
class LineResize
{
    // Each mask produces up to 16 output bytes.
    // For enlargement exactly 16, for shrinking up to 16, possibly even 0.
    std::vector<__m128i> masks;

    // Length is the same as masks.
    // For enlargement, the values contain source pointer offsets in bytes.
    // For shrinking, the values contain destination pointer offsets in bytes.
    std::vector<uint8_t> offsets;

    // True if this class will enlarge images, false if it will shrink the width of the images.
    bool enlargement;

    void resizeFields( size_t vectors )
    {
        masks.resize( vectors, _mm_set1_epi32( -1 ) );
        offsets.resize( vectors, 0 );
    }

public:

    // Compile the shuffle table. The arguments are line widths in pixels.
    LineResize( size_t source, size_t dest );

    // Apply the algorithm to a single line of the image.
    void apply( uint8_t* rdi, const uint8_t* rsi ) const;
};

LineResize::LineResize( size_t source, size_t dest )
{
    const size_t sourceBytes = source * 3;
    const size_t destBytes = dest * 3;
    assert( sourceBytes >= 16 );
    assert( destBytes >= 16 );

    // Possible to do much faster without any integer divides.
    // Optimizing this sample for simplicity.
    if( sourceBytes < destBytes )
    {
        // Enlarging the image, each SIMD vector consumes <16 input bytes, produces exactly 16 output bytes
        enlargement = true;
        resizeFields( ( destBytes   15 ) / 16 );

        int8_t* pMasks = (int8_t*)masks.data();
        uint8_t* const pOffsets = offsets.data();

        int sourceOffset = 0;
        const size_t countVectors = masks.size();
        for( size_t i = 0; i < countVectors; i   )
        {
            const int destSlice = (int)i * 16;
            std::array<int, 16> lanes;
            int lane;
            for( lane = 0; lane < 16; lane   )
            {
                const int destByte = destSlice   lane;  // output byte index
                const int destPixel = destByte / 3; // output pixel index
                const int channel = destByte % 3;   // output byte within pixel
                const int sourcePixel = destPixel * (int)source / (int)dest; // input pixel
                const int sourceByte = sourcePixel * 3   channel;   // input byte

                if( destByte < (int)destBytes )
                    lanes[ lane ] = sourceByte;
                else
                {
                    // Destination offset out of range, i.e. the last SIMD vector
                    break;
                }
            }

            // Produce the offset
            if( i == 0 )
                assert( lanes[ 0 ] == 0 );
            else
            {
                const int off = lanes[ 0 ] - sourceOffset;
                assert( off >= 0 && off <= 16 );
                pOffsets[ i - 1 ] = (uint8_t)off;
                sourceOffset = lanes[ 0 ];
            }

            // Produce the masks
            for( int j = 0; j < lane; j   )
                pMasks[ j ] = (int8_t)( lanes[ j ] - sourceOffset );
            // The masks are initialized with _mm_set1_epi32( -1 ) = all bits set,
            // no need to handle remainder for the last vector.
            pMasks  = 16;
        }
    }
    else
    {
        // Shrinking the image, each SIMD vector consumes 16 input bytes, produces <16 output bytes
        enlargement = false;
        resizeFields( ( sourceBytes   15 ) / 16 );

        // Not implemented, but the same idea works fine for this too.
        // The only difference, instead of using offsets bytes for source offsets, use it for destination offsets.
        assert( false );
    }
}

void LineResize::apply( uint8_t * rdi, const uint8_t * rsi ) const
{
    const __m128i* pm = masks.data();
    const __m128i* const pmEnd = pm   masks.size();
    const uint8_t* po = offsets.data();
    __m128i mask, source;

    if( enlargement )
    {
        // One iteration of the loop produces 16 output bytes
        // In MSVC results in 11 instructions for 16 output bytes.
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm  ;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi  = *po;
            po  ;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi  = 16;
        }
    }
    else
    {
        // One iteration of the loop consumes 16 input bytes
        while( pm < pmEnd )
        {
            mask = _mm_load_si128( pm );
            pm  ;

            source = _mm_loadu_si128( ( const __m128i * )( rsi ) );
            rsi  = 16;

            _mm_storeu_si128( ( __m128i * )rdi, _mm_shuffle_epi8( source, mask ) );
            rdi  = *po;
            po  ;
        }
    }
}

// Utility method to print RGB pixel values from the vector
static void printPixels( const std::vector<uint8_t>&vec )
{
    assert( !vec.empty() );
    assert( 0 == ( vec.size() % 3 ) );

    const uint8_t* rsi = vec.data();
    const uint8_t* const rsiEnd = rsi   vec.size();
    while( rsi < rsiEnd )
    {
        const uint32_t r = rsi[ 0 ];
        const uint32_t g = rsi[ 1 ];
        const uint32_t b = rsi[ 2 ];
        rsi  = 3;
        const uint32_t res = ( r << 16 ) | ( g << 8 ) | b;
        printf( "X ", res );
    }
    printf( "\n" );
}

// A triviual test to resize 24 pixels -> 32 pixels
int main()
{
    constexpr int sourceLength = 24;
    constexpr int destLength = 32;

    // Initialize sample input with 24 RGB pixels
    std::vector<uint8_t> input( sourceLength * 3 );
    for( size_t i = 0; i < input.size(); i   )
        input[ i ] = (uint8_t)i;

    printf( "Input: " );
    printPixels( input );

    // That special handling of the last pixels of last line is missing from this example.
    static_assert( 0 == destLength % 16 );
    LineResize resizer( sourceLength, destLength );

    std::vector<uint8_t> result( destLength * 3 );
    resizer.apply( result.data(), input.data() );

    printf( "Output: " );
    printPixels( result );
    return 0;
}

The code ignores alignment issues. For production, you’d want another method for the last line of the image which doesn’t run to the end, instead handles the last few pixels with scalar code.

The code contains more memory references in the hot loop. However, the two vectors in that class are not too long, for 4k images the size is about 12kb, should fit in L1D cache and stay there.

If you have AVX2, will probably improve things further. For enlarging images, use _mm256_inserti128_si256, the vinserti128 instruction can load 16 bytes from memory into high half of the vector. Similarly, for downsizing images, use _mm256_extracti128_si256, the instruction has an option to use memory destination.

CodePudding user response:

I think that using of _mm256_i32gather_epi32 (AVX2) can give some performance gain for resizing in case of 32 bit pixels:

inline void Gather32bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i val = _mm256_i32gather_epi32((int*)src, _idx, 1);
    _mm256_storeu_si256((__m256i*)dst, val);
}

template<> void Resize<4>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy  )
    {
        const uint8_t * srcY = src   idxY[dy] * srcW * 4;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx  = 8, offset  = 8*4)
            Gather32bit(srcY, idxX   dx,dst   offset);
        for (; dx < dstW; dx  , offset  = 4)
            CopyPixel<N>(srcY   idxX[dx], dst   offset);
        dst  = dstW * 4;
    }
}

P.S. After some modification this method can be applied to RGB24:

const __m256i K8_SHUFFLE = _mm256_setr_epi8(
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1,
    0x0, 0x1, 0x2, 0x4, 0x5, 0x6, 0x8, 0x9, 0xA, 0xC, 0xD, 0xE, -1, -1, -1, -1);
const __m256i K32_PERMUTE = _mm256_setr_epi32(0x0, 0x1, 0x2, 0x4, 0x5, 0x6, -1, -1);


inline void Gather24bit(const uint8_t * src, const int* idx, uint8_t* dst)
{
    __m256i _idx = _mm256_loadu_si256((__m256i*)idx);
    __m256i bgrx = _mm256_i32gather_epi32((int*)src, _idx, 1);
    __m256i bgr = _mm256_permutevar8x32_epi32(
        _mm256_shuffle_epi8(bgrx, K8_SHUFFLE), K32_PERMUTE);
    _mm256_storeu_si256((__m256i*)dst, bgr);
}

template<> void Resize<3>(const uint8_t* src, int srcW, int srcH, 
    uint8_t* dst, int dstW, int dstH)
{
    int idxY[dstH], idxX[dstW];//pre-calculated indices.
    size_t dstW8 = dstW & (8 - 1);
    for (int dy = 0; dy < dstH; dy  )
    {
        const uint8_t * srcY = src   idxY[dy] * srcW * 3;
        int dx = 0, offset = 0;
        for (; dx < dstW8; dx  = 8, offset  = 8*3)
            Gather24bit(srcY, idxX   dx,dst   offset);
        for (; dx < dstW; dx  , offset  = 3)
            CopyPixel<3>(srcY   idxX[dx], dst   offset);
        dst  = dstW * 3;
    }
}

Note that if srcW < dstW then method of @Aki-Suihkonen is faster.

  • Related