void reserve1(vector<vector<int>>& vec, size_t siz) {
for (auto& v : vec) {
v.reserve(siz);
}
}
void reserve2(vector<vector<int>>& vec, size_t siz) {
#pragma omp parallel default(shared) num_threads(vec.size())
{
vec[omp_get_thread_num()].reserve(vec);
}
}
void doSomething(vector<int>& v) {
// Do something write (and read) heavy on v.
}
int main() {
vector<vector<int>> data(omp_get_num_threads());
size_t theSize = 1000000;
reserve1(data, theSize); OR reserve2(data, theSize); <===================================
// Do some stuff here (not parallel)
#pragma omp parallel default(shared) num_threads(vec.size())
{
doSomething(data[omp_get_thread_num()]);
}
return 0;
}
In the code above is it better to use reserve2 rather than reserve1? Here are some pros and cons I could think of (I am not sure if they are correct):
Pros:
1- Allocation is done in parallel (and with appropriate version of malloc it might be faster)
2- Each thread is working with data allocated by itself which would have faster access (I have heard this from a colleague, but have no idea if that is correct).
Con:
1- Creating a parallel region has an overhead that might affect the performance.
Here are my questions and I would appreciate your help:
a) Are these pros and con correct or incorrect?
b) Are there other benefits/disadvantages in using parallel vs serial allocation here?
c) Do "number of threads" or "theSize" affect the pros/cons?
Edit:
Some clarification:
1- data always has the size equal to the number of threads that is given.
2- the size of data[i] vectors is not known in advance, but there is an estimate (the upper-bound is very large (almost nthreads*estimate) and the lower-bound is 0.)
3- doSomething is the method resizing data[i] (using e.g. push_back). It runs in parallel and the elements it collects is coming from a code I cannot change much.
4- In reality, data is a vector<vector>, where someObject is a small struct with multiple fields.
5- I can use something other than a vector, but I should be able to add elements one by one, and I do not know the exact final size, as mentioned before.
6- The code in main() is going to run thousands of times.
The main question for me is this: If thread i is working (reading/writing) on data[i] and data[i] only, would it affect the performance if data[i] is allocated by thread "i" or if it is allocated by a different thread (e.g. main thread in a non-parallel region)?
CodePudding user response:
Allocation is done in parallel (and with appropriate version of malloc it might be faster)
Each thread is working with data allocated by itself which would have faster access (I have heard this from a colleague, but have no idea if that is correct).
Yes, but most allocator does not scale well in parallel. The less scalable allocators use a centralized allocation policy protected with a giant lock that will results in a slower execution due to the overhead of the lock (and an effect called cache-line bouncing). The best allocator try to perform local allocation when it is possible but they often need to update some atomic counters and atomic operations does not scale well. Jemalloc for example is known to scale quite well and use an arena-based allocation policy which benefit from thread-local allocations. The (relatively recent update of) the default glibc allocator should also benefit from thread-local allocations. AFAIK, the one of Windows does not scale well.
Creating a parallel region has an overhead that might affect the performance.
Yes. In fact, malloc is generally pretty fast since it generally make only few system call (see sbrk
or memmap
) to get blocks of memory that are split up and tracked. Allocations on x86-64 mainstream processors takes no more than few dozens of nanoseconds. The overhead of initializing the OpenMP fork-join section and distributing the work is generally about few micro seconds (this is very dependent of the runtime and the scheduling policy though). As a result you need to perform a lot of allocations with a scalable allocator so the parallelization to be useful. In the end, the benefit of doing that should be negligible compare to the computation time (and it can cause more harm than benefits). If your computation is malloc-bound, then this is really bad and you should redesign the code so to reduce allocations in the first place.
a) Are these pros and con correct or incorrect?
Overall, yes.
Are there other benefits/disadvantages in using parallel vs serial allocation here?
One important point to consider is false sharing: if two variables stored in the same cache line and is modified by two threads, then the program can be slowed down because of a cache-line bouncing effect due to cache coherence. As a result, a vector<vector<int>>
can be an issue if the property of each sub-vector is modified in parallel by multiple threads (like the size for example) since this attributes may be stored on the same cache line than few others. Thread-local allocations can help to reduce this effect if the allocator permit it. This is also true for the integers in the vector but the effect should be negligible assuming the vector are big enough.
Do "number of threads" or "theSize" affect the pros/cons?
Yes. The overhead of the OpenMP section will be bigger with more threads and the allocator will scale less well with more threads. As a result, the benefit of using this strategy on many-core architecture is small.
Actually, vector<vector<int>>
is known to be a pretty inefficient data structure due to the number of allocation done, and also due to the memory fragmentation/diffusion caused by the allocations of the sub-vectors. If you want to deal with matrices, then please do not do that and use a flatten 2D array. If you deal with jagged array of fixed size, then please consider using a flatten array which is the concatenation of all the jagged array with additional arrays to store the start-end slices. This is far more cache-friendly since the array are stored more contiguously in memory. It also drastically reduce the number of allocation. This structure is only fine if the jagged arrays are (quite frequently) dynamically resized.
Update in response to the edit:
The main question for me is this: If thread i is working (reading/writing) on data[i] and data[i] only, would it affect the performance if data[i] is allocated by thread "i" or if it is allocated by a different thread (e.g. main thread in a non-parallel region)?
Yes, the ith thread will certainly impact the performance of others (typically the (i-1)th or/and the (i 1)th). The main issue is the frequent change of the vector length which should cause false-sharing. This is dependent of the size of a vector on the target platform but it is usually 24 bytes (at least for GCC, Clang and ICC) so few of them can be stored on the same cache line (64-bytes on x86-64 platforms). In addition, doing many push_back
causes the allocator to be quite saturated (dependent of the allocator, see the above explanation). Few large-vector resizes are generally enough to saturate the memory bandwidth (though it is dependent of the actual platform, the vector size, and the proportion of resizes compared to computation). In practice, one has to benchmark this to be sure.
You can avoid false-sharing with some padding. I also advise you to use a better-suited data-structure reducing the memory bandwidth usage like a std::deque
(or even a simple vector of raw buffers). But this is dependent of the actual operation computed. Sometimes it is faster to compute a problem using multiple passes (eg. one to compute the size of the items and one to actually perform the operation). But again, this is dependent of the actual problem solved.
CodePudding user response:
First of all, you should make your data structures to reflect your application. In most applications, it's most natural to talk about one long array, rather than to split it up into separate arrays. So: is your application best formulated as vector<vector<int>>
or are you using this only for perceived efficiency? If the latter: don't.
Regarding your pro: yes, it is good if every thread always works on the same segment of data. Cache, numa, page considerations and all that. But most of the time you can force that by using schedule(static)
for your for
loops (and which is the default anyway), and setting OMP_PROC_BIND=true
in the environment, so that threads stay tied to a fixed core.
Regarding your con: The overhead of a parallel region is almost nothing. Only if you create thousands of regions with very little work inside them, do you have to worry about this.