How to populate a pixel buffer much faster?-CodePudding

As part of a hobby project, I'm working on a 2D game engine that will draw each pixel every frame, using a color from a palette. I am looking for a way to do that while maintaining a reasonable frame rate (60fps being the minimum). Without any game-logic in place, I am updating the values of my pixels with some value form the palette.

I'm currently taking the mod of an index, to (hopefully) prevent the compiler from doing some loop-optimisation it could do with a fixed value. Below is my (very naive?) implementation of updating the bytes in the pixel array.

On an iPhone 12 Pro, each run of updating all pixel values takes on average 43 ms, while on a simulator running on an M1 mac, it takes 15 ms. Both unacceptable, as that would leave not for any additional game logic (which would be much more operations than taking the mod of an Int).

I was planning to look into Metal and set up a surface, but clearly the bottleneck here is the CPU, so if I can optimize this code, I could go for a higher-level framework. Any suggestions on a performant way to write this many bytes much, much faster (parallelisation is not an option)?

Instruments shows that most of the time is being spent in Swifts IndexingIterator.next() function. Maybe there is way to reduce the time spent there, there is quite a substantial subtree inside it.

struct BGRA
{
    let blue: UInt8
    let green: UInt8
    let red: UInt8
    let alpha: UInt8
}

let BGRAPallet =
[
    BGRA(blue: 124, green: 124, red: 124, alpha: 0xff),
    BGRA(blue: 252, green: 0, red: 0, alpha: 0xff),
// ... 62 more values in my code, omitting here for brevity
]

private func test()
{
    let screenWidth: Int = 256
    let screenHeight: Int = 240
    let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
    let runCount = 1000
    let start = Date.now
    for _ in 0 ..< runCount
    {
        for index in 0 ..< pixelBufferPtr.count
        {
            pixelBufferPtr[index] = BGRAPallet[index % BGRAPallet.count]
        }
    }
    let elapsed = Date.now.timeIntervalSince(start)
    print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
}

CodePudding user response：

First of all, I don't believe you're testing an optimized build for two reasons:

You say “This was measured with optimization set to Fastest [-O3].” But the Swift compiler doesn't recognize -O3 as a command-line flag. The C/C compiler recognizes that flag. For Swift the flags are -Onone, -O, -Osize, and -Ounchecked.
I ran your code on my M1 Max MacBook Pro in Debug configuration and it reported 15ms. Then I ran it in Release configuration and it reported 0ms. I had to increase the screen size to 2560x2400 (100x the pixels) to get it to report a time of 3ms.

Now, looking at your code, here are some things that stand out:

You're picking a color using BGRAPalette[index % BGRAPalette.count]. Since your palette size is 64, you can say BGRAPalette[index & 0b0011_1111] for the same result. I expected Swift to optimize that for me, but apparently it didn't, because making that change reduced the reported time to 2ms.
Indexing into BGRAPalette incurs a bounds check. You can avoid the bounds check by grabbing an UnsafeBufferPointer for the palette. Adding this optimization reduced the reported time to 1ms.

Here's my version:

public struct BGRA {
    let blue: UInt8
    let green: UInt8
    let red: UInt8
    let alpha: UInt8
}

func rng() -> UInt8 { UInt8.random(in: .min ... .max) }
let BGRAPalette = (0 ..< 64).map { _ in
    BGRA(blue: rng(), green: rng(), red: rng(), alpha: rng())
}

public func test() {
    let screenWidth: Int = 2560
    let screenHeight: Int = 2400
    let pixelCount = screenWidth * screenHeight
    let pixelBuffer = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: pixelCount)
    let runCount = 1000
    let start = SuspendingClock().now

    BGRAPalette.withUnsafeBufferPointer { paletteBuffer in
        for _ in 0 ..< runCount
        {
            for index in 0 ..< pixelCount
            {
                pixelBuffer[index] = paletteBuffer[index & 0b0011_1111]
            }
        }
    }

    let end = SuspendingClock().now
    let elapsed = end - start
    let msElapsed = elapsed.components.seconds * 1000   elapsed.components.attoseconds / 1_000_000_000_000_000
    print("Average time per run: \(msElapsed / Int64(runCount)) ms")
    // return pixelBuffer
}

@main
struct MyMain {
    static func main() {
        test()
    }
}

In addition to the two optimizations I described, I removed the dependency on Foundation (so I could paste the code into the compiler explorer) and corrected the spelling of ‘palette‘.

But realistically, even this isn't probably isn't particularly good test of your fill rate. You didn't say what kind of game you want to write, but given your screen size of 256x240, it's likely to use a tile-based map and sprites. If so, you shouldn't copy a pixel at a time. You can write a blitter that copies blocks of pixels at a time, using CPU instructions that operate on more than 32 bits at a time. ARM64 has 128-bit (16-byte) registers.

But even more realistically, you should use learn to use the GPU for your blitting. Not only is it faster for this sort of thing, it's probably more power-efficient too. Even though you're lighting up more of the chip, you're lighting it up for shorter intervals.

CodePudding user response：

Okay so i gave this a shot. You can probably move to using single input (or instruction), multiple output. This would probably speed up the whole process by 10 - 20 percent or so.

Here is the code link (same code will be pasted below for brevity): https://codecatch.net/post/4b9683bf-8e35-4bf5-a1a9-801ab2e73805

I made two versions just in case your systems architecture doesn't support simd_uint4. Let me know if this is what you were looking for.

import simd

private func test() {
    let screenWidth: Int = 256
    let screenHeight: Int = 240
    let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
    let runCount = 1000
    let start = Date.now
    for _ in 0 ..< runCount {
        var index = 0
        var palletIndex = 0
        let palletCount = BGRAPallet.count
        while index < pixelBufferPtr.count {
            let bgra = BGRAPallet[palletIndex]
            let bgraVector = simd_uint4(bgra.blue, bgra.green, bgra.red, bgra.alpha)
            let maxCount = min(pixelBufferPtr.count - index, 4)
            let pixelBuffer = pixelBufferPtr.baseAddress!   index
            pixelBuffer.storeBytes(of: bgraVector, as: simd_uint4.self)
            palletIndex  = 1
            if palletIndex == palletCount {
                palletIndex = 0
            }
            index  = maxCount
        }
    }
    let elapsed = Date.now.timeIntervalSince(start)
    print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
}