Home > database >  Why doesn't my OpenMP program scale with number of threads?
Why doesn't my OpenMP program scale with number of threads?

Time:10-23

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:

#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;



int main() {

    int* arr = new int[SIZE];
    long long sum = 0;
    int n_threads = 0;

    omp_set_num_threads(N_THREADS);
    double t1 = omp_get_wtime();

    #pragma omp parallel
    {
        if (omp_get_thread_num() == 0) {
            n_threads = omp_get_num_threads();
        }

        #pragma omp for schedule(static, 16)
        for (int i = 0; i < SIZE; i  ) {
            arr[i] = 1;
        }

        #pragma omp for schedule(static, 16) reduction( :sum)
        for (int i = 0; i < SIZE; i  ) {
            sum  = arr[i];
        }

    }

    double t2 = omp_get_wtime();

    cout << "n_threads " << n_threads << endl;
    cout << "time " << (t2 - t1)*1000 << endl;
    cout << sum << endl;

    
}

The run time (in milliseconds) for different values of N_THREADS is as follows:

n_threads 1
time 3.6718

n_threads 2
time 2.5308

n_threads 3
time 3.4383

n_threads 4
time 3.7427

n_threads 5
time 2.4621

I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?

CodePudding user response:

Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.

UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g -fopenmp -O3 -mavx2)

Here are the results, it still scales badly with number of threads:

n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000

n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000

n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000

n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000

n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000

CodePudding user response:

5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.

Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.

The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?

  • Related