As part of a hobby project, I'm working on a 2D game engine that will draw each pixel every frame, using a color from a palette. I am looking for a way to do that while maintaining a reasonable frame rate (60fps being the minimum). Without any game-logic in place, I am updating the values of my pixels with some value form the palette.
I'm currently taking the mod of an index, to (hopefully) prevent the compiler from doing some loop-optimisation it could do with a fixed value. Below is my (very naive?) implementation of updating the bytes in the pixel array.
On an iPhone 12 Pro, each run of updating all pixel values takes on average 43 ms, while on a simulator running on an M1 mac, it takes 15 ms. Both unacceptable, as that would leave not for any additional game logic (which would be much more operations than taking the mod of an Int).
I was planning to look into Metal and set up a surface, but clearly the bottleneck here is the CPU, so if I can optimize this code, I could go for a higher-level framework. Any suggestions on a performant way to write this many bytes much, much faster (parallelisation is not an option)?
Instruments shows that most of the time is being spent in Swifts IndexingIterator.next()
function. Maybe there is way to reduce the time spent there, there is quite a substantial subtree inside it.
struct BGRA
{
let blue: UInt8
let green: UInt8
let red: UInt8
let alpha: UInt8
}
let BGRAPallet =
[
BGRA(blue: 124, green: 124, red: 124, alpha: 0xff),
BGRA(blue: 252, green: 0, red: 0, alpha: 0xff),
// ... 62 more values in my code, omitting here for brevity
]
private func test()
{
let screenWidth: Int = 256
let screenHeight: Int = 240
let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
let runCount = 1000
let start = Date.now
for _ in 0 ..< runCount
{
for index in 0 ..< pixelBufferPtr.count
{
pixelBufferPtr[index] = BGRAPallet[index % BGRAPallet.count]
}
}
let elapsed = Date.now.timeIntervalSince(start)
print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
}
CodePudding user response:
First of all, I don't believe you're testing an optimized build for two reasons:
You say “This was measured with optimization set to
Fastest [-O3]
.” But the Swift compiler doesn't recognize-O3
as a command-line flag. The C/C compiler recognizes that flag. For Swift the flags are-Onone
,-O
,-Osize
, and-Ounchecked
.I ran your code on my M1 Max MacBook Pro in Debug configuration and it reported 15ms. Then I ran it in Release configuration and it reported 0ms. I had to increase the screen size to 2560x2400 (100x the pixels) to get it to report a time of 3ms.
Now, looking at your code, here are some things that stand out:
You're picking a color using
BGRAPalette[index % BGRAPalette.count]
. Since your palette size is 64, you can sayBGRAPalette[index & 0b0011_1111]
for the same result. I expected Swift to optimize that for me, but apparently it didn't, because making that change reduced the reported time to 2ms.Indexing into
BGRAPalette
incurs a bounds check. You can avoid the bounds check by grabbing anUnsafeBufferPointer
for the palette. Adding this optimization reduced the reported time to 1ms.
Here's my version:
public struct BGRA {
let blue: UInt8
let green: UInt8
let red: UInt8
let alpha: UInt8
}
func rng() -> UInt8 { UInt8.random(in: .min ... .max) }
let BGRAPalette = (0 ..< 64).map { _ in
BGRA(blue: rng(), green: rng(), red: rng(), alpha: rng())
}
public func test() {
let screenWidth: Int = 2560
let screenHeight: Int = 2400
let pixelCount = screenWidth * screenHeight
let pixelBuffer = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: pixelCount)
let runCount = 1000
let start = SuspendingClock().now
BGRAPalette.withUnsafeBufferPointer { paletteBuffer in
for _ in 0 ..< runCount
{
for index in 0 ..< pixelCount
{
pixelBuffer[index] = paletteBuffer[index & 0b0011_1111]
}
}
}
let end = SuspendingClock().now
let elapsed = end - start
let msElapsed = elapsed.components.seconds * 1000 elapsed.components.attoseconds / 1_000_000_000_000_000
print("Average time per run: \(msElapsed / Int64(runCount)) ms")
// return pixelBuffer
}
@main
struct MyMain {
static func main() {
test()
}
}
In addition to the two optimizations I described, I removed the dependency on Foundation (so I could paste the code into the compiler explorer) and corrected the spelling of ‘palette‘.
But realistically, even this isn't probably isn't particularly good test of your fill rate. You didn't say what kind of game you want to write, but given your screen size of 256x240, it's likely to use a tile-based map and sprites. If so, you shouldn't copy a pixel at a time. You can write a blitter that copies blocks of pixels at a time, using CPU instructions that operate on more than 32 bits at a time. ARM64 has 128-bit (16-byte) registers.
But even more realistically, you should use learn to use the GPU for your blitting. Not only is it faster for this sort of thing, it's probably more power-efficient too. Even though you're lighting up more of the chip, you're lighting it up for shorter intervals.
CodePudding user response:
Okay so i gave this a shot. You can probably move to using single input (or instruction), multiple output. This would probably speed up the whole process by 10 - 20 percent or so.
Here is the code link (same code will be pasted below for brevity): https://codecatch.net/post/4b9683bf-8e35-4bf5-a1a9-801ab2e73805
I made two versions just in case your systems architecture doesn't support simd_uint4. Let me know if this is what you were looking for.
import simd
private func test() {
let screenWidth: Int = 256
let screenHeight: Int = 240
let pixelBufferPtr = UnsafeMutableBufferPointer<BGRA>.allocate(capacity: screenWidth * screenHeight)
let runCount = 1000
let start = Date.now
for _ in 0 ..< runCount {
var index = 0
var palletIndex = 0
let palletCount = BGRAPallet.count
while index < pixelBufferPtr.count {
let bgra = BGRAPallet[palletIndex]
let bgraVector = simd_uint4(bgra.blue, bgra.green, bgra.red, bgra.alpha)
let maxCount = min(pixelBufferPtr.count - index, 4)
let pixelBuffer = pixelBufferPtr.baseAddress! index
pixelBuffer.storeBytes(of: bgraVector, as: simd_uint4.self)
palletIndex = 1
if palletIndex == palletCount {
palletIndex = 0
}
index = maxCount
}
}
let elapsed = Date.now.timeIntervalSince(start)
print("Average time per run: \((Int(elapsed) * 1000) / runCount) ms")
}