What are the fastest possible iteration techniques in C# for the following scenario ?
Since im working on a small archetype based ECS in c#, i want to make use of cache efficient iterations for maximum performance. What could i do to make the iteration faster and get the maximum cache hits ?
var chunks = archetype.Chunks; // Property that returns a Chunk[] array
for (var chunkIndex = 0; chunkIndex < archetype.Size; chunkIndex ) {
ref var chunk = ref chunks[chunkIndex];
var transforms = chunk.GetArray<Transform>(); // Returns a Transform[] array
var rotations = chunk.GetArray<Rotation>(); // Returns a Rotation[] array
for (var index = 0; index < chunk.Capacity; index ) {
ref var transform = ref transforms[index];
ref var rotation = ref rotations[index];
transform.x ;
rotation.w ;
}
}
Details...
public struct Transform{ float x; float y; }
public struct Rotation{ float x; float y; float z; float w; }
T[] (chunk).GetArray<T>(){
return fittingTightlyPackedManagedArrayForT as T[]; // Pseudocode
}
int (chunk).Capcity{ get; set; } // Just a property of how big each array is in the chunk, all having the same size
I already tested a unsafe variant to reduce the bound checks, however this increased the cache misses according to my benchmark and was only slightly faster ( not noticeable, not even for high amounts ).
What elese could i do to increase the iteration speed ? Glad for any feedback, techniques and tricks ! :)
CodePudding user response:
A plain loop over an array or list is as fast as you can do iteration in c#, at least unless you have some special knowledge not available to the compiler. The compiler should recognize that you are looping over an array, and skip the bounds-check. And doing an linear iteration should allow the CPU to prefetch data before it is actually needed.
In your example I would not be certain the compiler could remove the bounds-checks, since the loop check is not against for the array length. So I would at least try changing it to two separate loops over the array instead.
I'm not sure why the unsafe version had lower cache hit rate, the cache is controlled by the CPU, not the compiler, and I would expect an unsafe version to produce very similar code to the compiler, at least with regards to memory access.
In some special cases it might be useful to manually unroll loops, but the compiler should be able to do this automatically, and this question suggest it is of little use. But compiler optimizations can be fickle, it might not always apply optimizations you expect it would, and what optimizations it applies might be different between versions, how long it is run, if you apply profile guided optimizations etc.
To get any real gains I would look at SIMD techniques, if you can process larger chunks of data you might get some very significant gains. But the gains might depend in large part on how the data is stored and accessed.
In some cases there can be major gains by using a structure of arrays (SoA) approach rather than the more common arrays of structures (AoS). In your example, if all the x
and w
values where stored in separate arrays you could just process the entire array in 128/256/512 bit SIMD blocks, and that would be fairly difficult to beat. This also has great cache efficiency, since you are not loading any unnecessary bytes. But using the SoA approach might have performance implications for other parts of the code.