For a certain coding application i need to copy a vector consisting of big objects, so i want to make it more efficient. I'll give the old code below, with an attempt to use OpenMP to make it more efficient.
std::vector<Object> Objects, NewObjects;
Objects.reserve(30);
NewObjects.reserve(30);
// old code
Objects = NewObjects;
// new code
omp_set_num_threads(30);
#pragma omp parallel{
Objects[omp_get_thread_num()] = NewObjects[omp_get_thread_num()];
}
Would this give the same result? Or are there issues since i access the vector ' Object' . I thought it might work since i don't access the same index/Object.
CodePudding user response:
omp_set_num_threads(30)
does not guarantee that you obtain 30 threads, you may get less and your code will not work properly. You have to use a loop and parallelize it by OpenMP:
#pragma omp parallel for
for(size_t i=0;i<NewObjects.size(); i)
{
Objects[i] = NewObjects[i];
}
Note that It may not be faster than the serial version, because parallel execution has significant overheads.
If you use a C 17 compiler the best idea is to use std::copy
using parallel execution policy:
std::copy(std::execution::par, NewObjects.begin(), NewObjects.end(), Objects.begin());
CodePudding user response:
I created a benchmark to see how fast my test machine copies objects:
#include <benchmark/benchmark.h>
#include <omp.h>
#include <vector>
constexpr int operator "" _MB(unsigned long long v) { return v * 1024 * 1024; }
class CopyableBigObject
{
public:
CopyableBigObject(const size_t s) : vec(s) {}
CopyableBigObject(const CopyableBigObject& other) = default;
CopyableBigObject(CopyableBigObject&& other) = delete;
~CopyableBigObject() = default;
CopyableBigObject& operator =(const CopyableBigObject&) = default;
CopyableBigObject& operator =(CopyableBigObject&&) = delete;
char& operator [](const int index) { return vec[index]; }
size_t size() const { return vec.size(); }
private:
std::vector<char> vec;
};
// Force some work on the objects so they are not optimized away
int calculated_value(std::vector<CopyableBigObject>& vec)
{
int sum = 0;
for (int x = 0; x < vec.size(); x)
{
for (int index = 0; index < vec[x].size(); index = 100)
{
sum = vec[x][index];
}
}
return sum;
}
static void BM_copy_big_objects(benchmark::State& state)
{
const size_t number_of_objects = state.range(0);
const size_t data_size = state.range(1);
for (auto _ : state)
{
std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
std::vector<CopyableBigObject> dest;
state.counters["src"] = calculated_value(src);
dest = src;
state.counters["dest"] = calculated_value(dest);
}
}
static void BM_copy_big_objects_in_parallel(benchmark::State& state)
{
const size_t number_of_objects = state.range(0);
const size_t data_size = state.range(1);
const int number_of_threads = state.range(2);
for (auto _ : state)
{
std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
std::vector<CopyableBigObject> dest{ number_of_objects, CopyableBigObject(0) };
state.counters["src"] = calculated_value(src);
#pragma omp parallel num_threads(number_of_threads)
{
if (omp_get_thread_num() == 0)
{
state.counters["number_of_threads"] = omp_get_num_threads();
}
#pragma omp for
for (int x = 0; x < src.size(); x)
{
dest[x] = src[x];
}
}
state.counters["dest"] = calculated_value(dest);
}
}
BENCHMARK(BM_copy_big_objects)
->Unit(benchmark::kMillisecond)
->Args({ 30, 16_MB })
->Args({ 1000, 1_MB })
->Args({ 100, 8_MB });
BENCHMARK(BM_copy_big_objects_in_parallel)
->Unit(benchmark::kMillisecond)
->Args({ 100, 1_MB, 1 })
->Args({ 100, 8_MB, 1 })
->Args({ 800, 1_MB, 1 })
->Args({ 100, 8_MB, 2 })
->Args({ 100, 8_MB, 4 })
->Args({ 100, 8_MB, 8 });
BENCHMARK_MAIN();
These are results I got on my test machine, an old Xeon workstation:
Run on (4 X 2394 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 4096 KiB (x4)
L3 Unified 16384 KiB (x1)
Load Average: 0.25, 0.14, 0.10
--------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
BM_copy_big_objects/30/16777216 30.9 ms 30.5 ms 24 dest=0 src=0
BM_copy_big_objects/1000/1048576 0.352 ms 0.349 ms 1987 dest=0 src=0
BM_copy_big_objects/100/8388608 4.62 ms 4.57 ms 155 dest=0 src=0
BM_copy_big_objects_in_parallel/100/1048576/1 0.359 ms 0.355 ms 2028 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/1 4.67 ms 4.61 ms 151 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/800/1048576/1 0.357 ms 0.353 ms 1983 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/2 5.29 ms 5.23 ms 132 dest=0 number_of_threads=2 src=0
BM_copy_big_objects_in_parallel/100/8388608/4 5.32 ms 5.25 ms 133 dest=0 number_of_threads=4 src=0
BM_copy_big_objects_in_parallel/100/8388608/8 5.57 ms 3.98 ms 175 dest=0 number_of_threads=8 src=0
As I expected, parallelizing copying does not improve performance. However, copying large objects is slower than I expected.
Given you stated that you use C 14, there are a number of things you can try which could improve performance:
- Move the objects using the move-constructor / move-assignment combination or unique_ptr instead of copying.
- Defer making copies of member variables until you really need them by using Copy-On-Write.
- This will make copying cheap until you have to update a big object.
- If a large proportion of your objects are not updated after they have been copied then you should get a performance boost.
- Make sure your class definitions are using the most compact representation. I have seen classes be different sizes depending on whether it is a release build or a debug build because the compiler was using padding for the release build but not the debug build.
- Possibly rewrite so copying is avoided altogether.
Without knowing the specific details of your objects, it is not possible to give a specific answer. However, this should point to a full solution.