OpenMP: copying vector using ' multithreading'-CodePudding

For a certain coding application i need to copy a vector consisting of big objects, so i want to make it more efficient. I'll give the old code below, with an attempt to use OpenMP to make it more efficient.

std::vector<Object> Objects, NewObjects;
Objects.reserve(30);
NewObjects.reserve(30);
// old code
Objects = NewObjects;

// new code
omp_set_num_threads(30);
#pragma omp parallel{
    Objects[omp_get_thread_num()] = NewObjects[omp_get_thread_num()];
}

Would this give the same result? Or are there issues since i access the vector ' Object' . I thought it might work since i don't access the same index/Object.

CodePudding user response：

omp_set_num_threads(30) does not guarantee that you obtain 30 threads, you may get less and your code will not work properly. You have to use a loop and parallelize it by OpenMP:

#pragma omp parallel for
for(size_t i=0;i<NewObjects.size();   i)
{
    Objects[i] = NewObjects[i];
}

Note that It may not be faster than the serial version, because parallel execution has significant overheads.

If you use a C 17 compiler the best idea is to use std::copy using parallel execution policy:

std::copy(std::execution::par, NewObjects.begin(), NewObjects.end(), Objects.begin());

CodePudding user response：

I created a benchmark to see how fast my test machine copies objects:

#include <benchmark/benchmark.h>
#include <omp.h>
#include <vector>

constexpr int operator "" _MB(unsigned long long v) { return v * 1024 * 1024; }

class CopyableBigObject
{
public:
    CopyableBigObject(const size_t s) : vec(s) {}
    CopyableBigObject(const CopyableBigObject& other) = default;
    CopyableBigObject(CopyableBigObject&& other) = delete;
    ~CopyableBigObject() = default;

    CopyableBigObject& operator =(const CopyableBigObject&) = default;
    CopyableBigObject& operator =(CopyableBigObject&&) = delete;

    char& operator [](const int index) { return vec[index]; }
    size_t size() const { return vec.size(); }

private:
    std::vector<char> vec;
};

// Force some work on the objects so they are not optimized away
int calculated_value(std::vector<CopyableBigObject>& vec)
{
    int sum = 0;

    for (int x = 0; x < vec.size();   x)
    {
        for (int index = 0; index < vec[x].size(); index  = 100)
        {
            sum  = vec[x][index];
        }
    }

    return sum;
}

static void BM_copy_big_objects(benchmark::State& state)
{
    const size_t number_of_objects = state.range(0);
    const size_t data_size = state.range(1);

    for (auto _ : state)
    {
        std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
        std::vector<CopyableBigObject> dest;

        state.counters["src"] = calculated_value(src);
        dest = src;
        state.counters["dest"] = calculated_value(dest);
    }
}

static void BM_copy_big_objects_in_parallel(benchmark::State& state)
{
    const size_t number_of_objects = state.range(0);
    const size_t data_size = state.range(1);
    const int number_of_threads = state.range(2);

    for (auto _ : state)
    {
        std::vector<CopyableBigObject> src{ number_of_objects, CopyableBigObject(data_size) };
        std::vector<CopyableBigObject> dest{ number_of_objects, CopyableBigObject(0) };

        state.counters["src"] = calculated_value(src);

#pragma omp parallel num_threads(number_of_threads)
        {
            if (omp_get_thread_num() == 0)
            {
                state.counters["number_of_threads"] = omp_get_num_threads();
            }

#pragma omp for
            for (int x = 0; x < src.size();   x)
            {
                dest[x] = src[x];
            }
        }

        state.counters["dest"] = calculated_value(dest);
    }
}

BENCHMARK(BM_copy_big_objects)
    ->Unit(benchmark::kMillisecond)
    ->Args({   30, 16_MB })
    ->Args({ 1000,  1_MB })
    ->Args({  100,  8_MB });

BENCHMARK(BM_copy_big_objects_in_parallel)
    ->Unit(benchmark::kMillisecond)
    ->Args({ 100, 1_MB, 1 })
    ->Args({ 100, 8_MB, 1 })
    ->Args({ 800, 1_MB, 1 })
    ->Args({ 100, 8_MB, 2 })
    ->Args({ 100, 8_MB, 4 })
    ->Args({ 100, 8_MB, 8 });

BENCHMARK_MAIN();

These are results I got on my test machine, an old Xeon workstation:

Run on (4 X 2394 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 4096 KiB (x4)
  L3 Unified 16384 KiB (x1)
Load Average: 0.25, 0.14, 0.10
--------------------------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------
BM_copy_big_objects/30/16777216                     30.9 ms         30.5 ms           24 dest=0 src=0
BM_copy_big_objects/1000/1048576                   0.352 ms        0.349 ms         1987 dest=0 src=0
BM_copy_big_objects/100/8388608                     4.62 ms         4.57 ms          155 dest=0 src=0
BM_copy_big_objects_in_parallel/100/1048576/1      0.359 ms        0.355 ms         2028 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/1       4.67 ms         4.61 ms          151 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/800/1048576/1      0.357 ms        0.353 ms         1983 dest=0 number_of_threads=1 src=0
BM_copy_big_objects_in_parallel/100/8388608/2       5.29 ms         5.23 ms          132 dest=0 number_of_threads=2 src=0
BM_copy_big_objects_in_parallel/100/8388608/4       5.32 ms         5.25 ms          133 dest=0 number_of_threads=4 src=0
BM_copy_big_objects_in_parallel/100/8388608/8       5.57 ms         3.98 ms          175 dest=0 number_of_threads=8 src=0

As I expected, parallelizing copying does not improve performance. However, copying large objects is slower than I expected.

Given you stated that you use C 14, there are a number of things you can try which could improve performance:

Move the objects using the move-constructor / move-assignment combination or unique_ptr instead of copying.
Defer making copies of member variables until you really need them by using Copy-On-Write.
1. This will make copying cheap until you have to update a big object.
2. If a large proportion of your objects are not updated after they have been copied then you should get a performance boost.
Make sure your class definitions are using the most compact representation. I have seen classes be different sizes depending on whether it is a release build or a debug build because the compiler was using padding for the release build but not the debug build.
Possibly rewrite so copying is avoided altogether.

Without knowing the specific details of your objects, it is not possible to give a specific answer. However, this should point to a full solution.