I have a particle pools which stored particle pointers
It has a large number of particles
Every particle has the IsActive variable
When partitioning the vector every frame according to their active state, there is high cpu usage
So instead is it a good way to partition the vector every few seconds on a worker thread?
CodePudding user response:
Since the order of the active doesn't matter, and they don't become active or inactive very often, you can use a trick to keep the active particles at the front:
Suppose that nActive
is the number of active particles.
When an inactive particle becomes active, do std::swap(vector[nParticle], vector[nActive]); nActive ;
i.e. make it be the last active particle - and whatever inactive particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.
When an active particle becomes inactive, do nActive--; std::swap(vector[nParticle], vector[nActive]);
i.e. make it be the first inactive particle - and whatever active particle was already in that slot, goes to where the particle used to be, and we don't care, because we don't care about the order.
You can remember this trick by remembering that it's easy to make a particle active or inactive - if we don't care which one it is - because we just add or subtract 1 from nActive
. But then we don't know which particle we just changed. So we swap that particle with the one we meant to change, and we don't care where that particle goes, as long as it stays inactive or stays active (doesn't change).
Like this:
----------- ----------- ----------- -----------
| Particle0 | Particle1 | Particle2 | Particle3 |
| active | active | inactive | inactive |
----------- ----------- ----------- -----------
^
|
nActive==2
make particle 3 active: swap particle 3 and 2, then increment nActive:
----------- ----------- ----------- -----------
| Particle0 | Particle1 | Particle3 | Particle2 |
| active | active | active | inactive |
----------- ----------- ----------- -----------
^
|
nActive==3
make particle 0 inactive: decrement nActive, then swap particle 3 (in slot 2) and 0.
----------- ----------- ----------- -----------
| Particle3 | Particle1 | Particle0 | Particle2 |
| active | active | inactive | inactive |
----------- ----------- ----------- -----------
^
|
nActive==2
The particle order is all jumbled up now, but you don't care about that, right?
CodePudding user response:
While user253751's answer is absolutely correct for optimizing your partitioning scheme, it sounds like you're going to have a larger issue relating to the number of particles and cache coherency.
I'm guessing that the data structure being used is something along the lines of
struct Particle
{
vec2 position;
vec2 velocity;
float scale;
float scaleVelocity;
float lifeRemaining;
vec4 color;
// Other attributes
};
std::vector<Particle> particles;
void updateParticles(const float delta)
{
for(size_t i = 0; i < particles.size(); i)
{
particles[i].position = particles[i].velocity;
particles[i].scale = particles[i].scaleVelocity;
particles[i].lifeRemaining -= delta;
}
}
This results in each particle element being packed one after another. This will have worse performance in a multithreaded environment as synchronization may have to occur between cores in a single cache line (assuming you are processing each component separately instead of chunks, either way being able to vectorize will still be better).
Position | Velocity | Scale | Scale Velocity | Life Remaining | Color |
---|---|---|---|---|---|
position_0 | velocity_0 | scale_0 | scaleVelocity_0 | lifeRemaining_0 | color_0 |
position_1 | velocity_1 | scale_1 | scaleVelocity_1 | lifeRemaining_1 | color_1 |
position_2 | velocity_2 | scale_2 | scaleVelocity_2 | lifeRemaining_2 | color_2 |
position_3 | velocity_3 | scale_3 | scaleVelocity_3 | lifeRemaining_3 | color_3 |
This has poor cache coherency and isn't condusive to vectorization.
In this case you can likely substantially improve performance by switching to a Data Oriented Programming approach similar to
struct ParticleContainer
{
std::vector<vec2> position;
std::vector<vec2> velocity;
std::vector<float> scale;
std::vector<float> lifeRemaining;
std::vector<vec4> color;
};
ParticleContainer particles;
void updateParticles(const float delta)
{
for(size_t i = 0; i < particles.size(); i)
{
particles.position[i] = particles.velocity[i];
}
for(size_t i = 0; i < particles.size(); i)
{
particles.scale[i] = particles.scaleVelocity[i];
}
for(size_t i = 0; i < particles.size(); i)
{
particles.lifeRemaining[i] -= delta;
}
}
This results in identical elements being packed together, allowing you to operate on multiple of them at a time
Particle 0 | Particle 1 | Particle 2 | Particle 3 |
---|---|---|---|
position_0 | position_1 | position_2 | position_3 |
velocity_0 | velocity_1 | velocity_2 | velocity_3 |
scale_0 | scale_1 | scale_2 | scale_3 |
scaleVelocity_0 | scaleVelocity_1 | scaleVelocity_2 | scaleVelocity_3 |
lifeRemaining_0 | lifeRemaining_1 | lifeRemaining_2 | lifeRemaining_3 |
color_0 | color_1 | color_2 | color_3 |
This memory layout is much more condusive to vectorization and can drastrically improve performance due to cache coherency. It also gives you the option to manually vectorize your code in something similar to
void updateParticles(const float delta)
{
// A vec2 of type float requires 64 bits, allowing us to pack 2 into a 128 bit vector.
for(size_t i = 0; i < particles.size() / 2; i = 2)
{
const __m128 positionVector = _mm_loadu_ps(&particles.position[i]);
const __m128 velocityVector = _mm_loadu_ps(&particles.velocity[i]);
const __m128 newPosition = _mm_add_ps(positionVector, velocityVector);
_mm_storeu_ps(&particles.position[i], newPosition);
}
// A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
for(size_t i = 0; i < particles.size() / 4; i = 4)
{
const __m128 scaleVector = _mm_loadu_ps(&particles.scale[i]);
const __m128 scaleVelocityVector = _mm_loadu_ps(&particles.scaleVelocity[i]);
const __m128 newScale = _mm_add_ps(positionVector, scaleVelocityVector);
_mm_storeu_ps(&particles.scale[i], newScale);
}
const __m128 deltaVector = _mm_set_ps1(delta);
// A float requires 32 bits, allowing us to pack 4 into a 128 bit vector.
for(size_t i = 0; i < particles.size() / 4; i = 4)
{
const __m128 lifeRemainingVector = _mm_loadu_ps(&particles.lifeRemaining[i]);
const __m128 newLifeRemaining = _mm_sub_ps(lifeRemainingVector, deltaVector);
_mm_storeu_ps(&particles.lifeRemaining[i], newLifeRemaining);
}
}
This code obviously requires some sanity checks such as ensuring that each array size is always a multiple of the vector size. Additionally if you can ensure that the memory starts on the 16 byte boundary you can use the aligned load and store functions. You can also use AVX and AVX2 256 bit registers if those are available to you.
Memory accesses are the slowest part of modern computers, and doing this can ensure as much data is operated on as possible at a given time. This also makes it easier for the CPU to eagerly load memory lines. You can also now use this in a multicore environment with substanitally better performance and without clobbering cache line usage, this can either be done with multiple threads on a CPU or easily performed on a GPGPU using something like CUDA, OpenCL, or compute shaders.