Home > front end >  Particle Pool Performance Optimization In C
Particle Pool Performance Optimization In C

Time:03-25

I have a particle pools which stored particle pointers

It has a large number of particles

Every particle has the IsActive variable

When partitioning the vector every frame according to their active state, there is high cpu usage

So instead is it a good way to partition the vector every few seconds on a worker thread?

CodePudding user response:

Since the order of the active doesn't matter, and they don't become active or inactive very often, you can use a trick to keep the active particles at the front:

Suppose that nActive is the number of active particles.

When an inactive particle becomes active, do std::swap(vector[nParticle], vector[nActive]); nActive ; i.e. make it be the last active particle - and whatever inactive particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.

When an active particle becomes inactive, do nActive--; std::swap(vector[nParticle], vector[nActive]); i.e. make it be the first inactive particle - and whatever active particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.

You can remember this trick by remembering that it's easy to make a particle active or inactive - if we don't care which one it is - because we just add or subtract 1 from nActive. But then we don't know which particle we just changed. So we swap that particle with the one we meant to change, and we don't care where that particle goes, as long as it stays inactive or stays active (doesn't change).

Like this:

 ----------- ----------- ----------- ----------- 
| Particle0 | Particle1 | Particle2 | Particle3 |
| active    | active    | inactive  | inactive  |
 ----------- ----------- ----------- ----------- 
                        ^
                        |
                    nActive==2

make particle 3 active: swap particle 3 and 2, then increment nActive:

 ----------- ----------- ----------- ----------- 
| Particle0 | Particle1 | Particle3 | Particle2 |
| active    | active    | active    | inactive  |
 ----------- ----------- ----------- ----------- 
                                    ^
                                    |
                                nActive==3

make particle 0 inactive: decrement nActive, then swap particle 3 (in slot 2) and 0.

 ----------- ----------- ----------- ----------- 
| Particle3 | Particle1 | Particle0 | Particle2 |
| active    | active    | inactive  | inactive  |
 ----------- ----------- ----------- ----------- 
                        ^
                        |
                    nActive==2

The particle order is all jumbled up now, but you don't care about that, right?

CodePudding user response:

While user253751's answer is absolutely correct for optimizing your partitioning scheme, it sounds like you're going to have a larger issue relating to the number of particles and cache coherency.

I'm guessing that the data structure being used is something along the lines of

struct Particle
{
    vec2 position;
    vec2 velocity;
    float scale;
    float scaleVelocity;
    float lifeRemaining;
    vec4 color;
    // Other attributes
};

std::vector<Particle> particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size();   i)
    {
        particles[i].position  = particles[i].velocity;
        particles[i].scale  = particles[i].scaleVelocity;
        particles[i].lifeRemaining -= delta;
    }
}

This results in each particle element being packed one after another. This will have worse performance in a multithreaded environment as synchronization may have to occur between cores in a single cache line (assuming you are processing each component separately instead of chunks, either way being able to vectorize will still be better).

Position Velocity Scale Scale Velocity Life Remaining Color
position_0 velocity_0 scale_0 scaleVelocity_0 lifeRemaining_0 color_0
position_1 velocity_1 scale_1 scaleVelocity_1 lifeRemaining_1 color_1
position_2 velocity_2 scale_2 scaleVelocity_2 lifeRemaining_2 color_2
position_3 velocity_3 scale_3 scaleVelocity_3 lifeRemaining_3 color_3

This has poor cache coherency and isn't condusive to vectorization.

In this case you can likely substantially improve performance by switching to a Data Oriented Programming approach similar to

struct ParticleContainer
{
    std::vector<vec2> position;
    std::vector<vec2> velocity;
    std::vector<float> scale;
    std::vector<float> lifeRemaining;
    std::vector<vec4> color;
};

ParticleContainer particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size();   i)
    {
        particles.position[i]  = particles.velocity[i];
    }

    for(size_t i = 0; i < particles.size();   i)
    {
        particles.scale[i]  = particles.scaleVelocity[i];
    }

    for(size_t i = 0; i < particles.size();   i)
    {
        particles.lifeRemaining[i] -= delta;
    }
}

This results in identical elements being packed together, allowing you to operate on multiple of them at a time

Particle 0 Particle 1 Particle 2 Particle 3
position_0 position_1 position_2 position_3
velocity_0 velocity_1 velocity_2 velocity_3
scale_0 scale_1 scale_2 scale_3
scaleVelocity_0 scaleVelocity_1 scaleVelocity_2 scaleVelocity_3
lifeRemaining_0 lifeRemaining_1 lifeRemaining_2 lifeRemaining_3
color_0 color_1 color_2 color_3

This memory layout is much more condusive to vectorization and can drastrically improve performance due to cache coherency. It also gives you the option to manually vectorize your code in something similar to

void updateParticles(const float delta)
{
    // A vec2 of type float requires 64 bits, allowing us to pack 2 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 2; i  = 2)
    {
        const __m128 positionVector = _mm_loadu_ps(&particles.position[i]);
        const __m128 velocityVector = _mm_loadu_ps(&particles.velocity[i]);
        const __m128 newPosition = _mm_add_ps(positionVector, velocityVector);
        _mm_storeu_ps(&particles.position[i], newPosition);
    }

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i  = 4)
    {
        const __m128 scaleVector = _mm_loadu_ps(&particles.scale[i]);
        const __m128 scaleVelocityVector = _mm_loadu_ps(&particles.scaleVelocity[i]);
        const __m128 newScale = _mm_add_ps(positionVector, scaleVelocityVector);
        _mm_storeu_ps(&particles.scale[i], newScale);
    }

    const __m128 deltaVector = _mm_set_ps1(delta);

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i  = 4)
    {
        const __m128 lifeRemainingVector = _mm_loadu_ps(&particles.lifeRemaining[i]);
        const __m128 newLifeRemaining = _mm_sub_ps(lifeRemainingVector, deltaVector);
        _mm_storeu_ps(&particles.lifeRemaining[i], newLifeRemaining);
    }
}

This code obviously requires some sanity checks such as ensuring that each array size is always a multiple of the vector size. Additionally if you can ensure that the memory starts on the 16 byte boundary you can use the aligned load and store functions. You can also use AVX and AVX2 256 bit registers if those are available to you.

Memory accesses are the slowest part of modern computers, and doing this can ensure as much data is operated on as possible at a given time. This also makes it easier for the CPU to eagerly load memory lines. You can also now use this in a multicore environment with substanitally better performance and without clobbering cache line usage, this can either be done with multiple threads on a CPU or easily performed on a GPGPU using something like CUDA, OpenCL, or compute shaders.

  • Related