Particle Pool Performance Optimization In C-CodePudding

I have a particle pools which stored particle pointers

It has a large number of particles

Every particle has the IsActive variable

When partitioning the vector every frame according to their active state, there is high cpu usage

So instead is it a good way to partition the vector every few seconds on a worker thread?

CodePudding user response：

Since the order of the active doesn't matter, and they don't become active or inactive very often, you can use a trick to keep the active particles at the front:

Suppose that nActive is the number of active particles.

When an inactive particle becomes active, do std::swap(vector[nParticle], vector[nActive]); nActive ; i.e. make it be the last active particle - and whatever inactive particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.

When an active particle becomes inactive, do nActive--; std::swap(vector[nParticle], vector[nActive]); i.e. make it be the first inactive particle - and whatever active particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.

You can remember this trick by remembering that it's easy to make a particle active or inactive - if we don't care which one it is - because we just add or subtract 1 from nActive. But then we don't know which particle we just changed. So we swap that particle with the one we meant to change, and we don't care where that particle goes, as long as it stays inactive or stays active (doesn't change).

Like this:

 ----------- ----------- ----------- ----------- 
| Particle0 | Particle1 | Particle2 | Particle3 |
| active    | active    | inactive  | inactive  |
 ----------- ----------- ----------- ----------- 
                        ^
                        |
                    nActive==2

make particle 3 active: swap particle 3 and 2, then increment nActive:

 ----------- ----------- ----------- ----------- 
| Particle0 | Particle1 | Particle3 | Particle2 |
| active    | active    | active    | inactive  |
 ----------- ----------- ----------- ----------- 
                                    ^
                                    |
                                nActive==3

make particle 0 inactive: decrement nActive, then swap particle 3 (in slot 2) and 0.

 ----------- ----------- ----------- ----------- 
| Particle3 | Particle1 | Particle0 | Particle2 |
| active    | active    | inactive  | inactive  |
 ----------- ----------- ----------- ----------- 
                        ^
                        |
                    nActive==2

The particle order is all jumbled up now, but you don't care about that, right?

CodePudding user response：

While user253751's answer is absolutely correct for optimizing your partitioning scheme, it sounds like you're going to have a larger issue relating to the number of particles and cache coherency.

I'm guessing that the data structure being used is something along the lines of

struct Particle
{
    vec2 position;
    vec2 velocity;
    float scale;
    float scaleVelocity;
    float lifeRemaining;
    vec4 color;
    // Other attributes
};

std::vector<Particle> particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size();   i)
    {
        particles[i].position  = particles[i].velocity;
        particles[i].scale  = particles[i].scaleVelocity;
        particles[i].lifeRemaining -= delta;
    }
}

This results in each particle element being packed one after another. This will have worse performance in a multithreaded environment as synchronization may have to occur between cores in a single cache line (assuming you are processing each component separately instead of chunks, either way being able to vectorize will still be better).

Position	Velocity	Scale	Scale Velocity	Life Remaining	Color
position_0	velocity_0	scale_0	scaleVelocity_0	lifeRemaining_0	color_0
position_1	velocity_1	scale_1	scaleVelocity_1	lifeRemaining_1	color_1
position_2	velocity_2	scale_2	scaleVelocity_2	lifeRemaining_2	color_2
position_3	velocity_3	scale_3	scaleVelocity_3	lifeRemaining_3	color_3

This has poor cache coherency and isn't condusive to vectorization.

In this case you can likely substantially improve performance by switching to a Data Oriented Programming approach similar to

struct ParticleContainer
{
    std::vector<vec2> position;
    std::vector<vec2> velocity;
    std::vector<float> scale;
    std::vector<float> lifeRemaining;
    std::vector<vec4> color;
};

ParticleContainer particles;

void updateParticles(const float delta)
{
    for(size_t i = 0; i < particles.size();   i)
    {
        particles.position[i]  = particles.velocity[i];
    }

    for(size_t i = 0; i < particles.size();   i)
    {
        particles.scale[i]  = particles.scaleVelocity[i];
    }

    for(size_t i = 0; i < particles.size();   i)
    {
        particles.lifeRemaining[i] -= delta;
    }
}

This results in identical elements being packed together, allowing you to operate on multiple of them at a time

Particle 0	Particle 1	Particle 2	Particle 3
position_0	position_1	position_2	position_3
velocity_0	velocity_1	velocity_2	velocity_3
scale_0	scale_1	scale_2	scale_3
scaleVelocity_0	scaleVelocity_1	scaleVelocity_2	scaleVelocity_3
lifeRemaining_0	lifeRemaining_1	lifeRemaining_2	lifeRemaining_3
color_0	color_1	color_2	color_3

This memory layout is much more condusive to vectorization and can drastrically improve performance due to cache coherency. It also gives you the option to manually vectorize your code in something similar to

void updateParticles(const float delta)
{
    // A vec2 of type float requires 64 bits, allowing us to pack 2 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 2; i  = 2)
    {
        const __m128 positionVector = _mm_loadu_ps(&particles.position[i]);
        const __m128 velocityVector = _mm_loadu_ps(&particles.velocity[i]);
        const __m128 newPosition = _mm_add_ps(positionVector, velocityVector);
        _mm_storeu_ps(&particles.position[i], newPosition);
    }

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i  = 4)
    {
        const __m128 scaleVector = _mm_loadu_ps(&particles.scale[i]);
        const __m128 scaleVelocityVector = _mm_loadu_ps(&particles.scaleVelocity[i]);
        const __m128 newScale = _mm_add_ps(positionVector, scaleVelocityVector);
        _mm_storeu_ps(&particles.scale[i], newScale);
    }

    const __m128 deltaVector = _mm_set_ps1(delta);

    // A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
    for(size_t i = 0; i < particles.size() / 4; i  = 4)
    {
        const __m128 lifeRemainingVector = _mm_loadu_ps(&particles.lifeRemaining[i]);
        const __m128 newLifeRemaining = _mm_sub_ps(lifeRemainingVector, deltaVector);
        _mm_storeu_ps(&particles.lifeRemaining[i], newLifeRemaining);
    }
}

This code obviously requires some sanity checks such as ensuring that each array size is always a multiple of the vector size. Additionally if you can ensure that the memory starts on the 16 byte boundary you can use the aligned load and store functions. You can also use AVX and AVX2 256 bit registers if those are available to you.

Memory accesses are the slowest part of modern computers, and doing this can ensure as much data is operated on as possible at a given time. This also makes it easier for the CPU to eagerly load memory lines. You can also now use this in a multicore environment with substanitally better performance and without clobbering cache line usage, this can either be done with multiple threads on a CPU or easily performed on a GPGPU using something like CUDA, OpenCL, or compute shaders.