OpenMP paralellized C code exhibits higher execution times on more threads?-CodePudding

I'm trying to learn paralellization of C using openmp, and I'm trying to use the following example. But for some reason when I increase the number of threads the code runs slower. Im compiling it using the -fopenmp flag. It would be nice if I could get your expert opinion.

#include <omp.h>
#include <iostream>
static long num_steps =100000000; 
#define NUM_THREADS 4

double  step;
int main(){
    int i,nthreads;
    double pi, sum[NUM_THREADS]; // should be shared : hence promoted scalar sum into an array
   
    step  = 1.0/(double) num_steps;
    omp_set_num_threads(NUM_THREADS);

    double t1 = omp_get_wtime();
    #pragma omp parallel
    {
    int i, id, nthrds;
    double x;
    id = omp_get_thread_num();
    nthrds = omp_get_num_threads();
    //if(id==0) nthreads = nthrds; // This is done because the number of threads can be different
                                 // ie the environment can give you a different number of threads
                                 // than  requested

    for(i=id, sum[id] = 0.0; i<num_steps;i=i nthrds){

        x = (i 0.5)*step;
        sum[id]  = 4.0/(1.0 x*x);
    }
    }

    double t2 = omp_get_wtime();

    std::cout << "Time : " ;

    double ms_double = t2 - t1;

    std::cout << ms_double << "ms\n";

    for(i=0,pi=0.0; i < nthreads; i  ){
        pi  = sum[i]*step;
    }
}

CodePudding user response：

Minor complaints aside, your big problem is the loop update i=i nthrds. This means that each cache line will be accessed by all 4 of your threads. (Btw, use the OMP_NUM_THREADS environment variable to set the number of threads. Do not hardcode.) This is called false sharing and it's really bad for performance: you want each cacheline to be exclusively in one core.

CodePudding user response：

The main advantage of OpenMP is that you do not have to do reduction manually. You just have to add an extra line to the serial code. So, your code should be something like this (which is free from false-sharing):

double sum=0;
#pragma omp parallel for reduction( :sum)
    for(unsigned long i=0; i<num_steps;   i){
        const double x = (i 0.5)*step;
        sum  = 4.0/(1.0 x*x);
    }
double pi = sum*step;

Note that your code had an uninitialized variable (pi) and your code did not handle the properly if you got less threads than requested.

CodePudding user response：

What @Victor Ejkhout called "minor complaints" might not be so minor. It is only normal that using a new API (omp) for the first time can be confusing. And that reflects on the coding style of the application code as well, more often than not. But especially in such cases, special attention should be paid to readability.

The code below is the "prettied-up" version of your attempt. And next to the omp parallel integration it also has the single threaded and a multi threaded (using std::thread) version so you can compare them to each other.

#include <omp.h>
#include <iostream>
#include <thread>

constexpr int MAX_PARALLEL_THREADS = 4; // long is wrong - is it an isize_t or a int32_t or an int64_t???

// the function we want to integrate
double f(double x) {
  return 4.0 / (1.0   x * x);
}

// performs the summation of function values on the interval [left,right[
double sum_interval(double left, double right, double step) {
  double sum = 0.0;
  for (double x = left; x < right; x  = step) {
    sum  = f(x);
  }
  return sum;
}

double integrate_single_threaded(double left, double right, double step) {
  return sum_interval(left, right, step) / (right - left);
}

double integrate_multi_threaded(double left, double right, double step) {
  double sums[MAX_PARALLEL_THREADS];
  std::thread threads[MAX_PARALLEL_THREADS];
  for (int i= 0; i < MAX_PARALLEL_THREADS;i  ) {
    threads[i] = std::thread( [&sums,left,right,step,i] () {
      double ileft = left   (right - left) / MAX_PARALLEL_THREADS * i;
      double iright = left   (right - left) / MAX_PARALLEL_THREADS * (i   1);
      sums[i] = sum_interval(ileft,iright,step);
    });
  }
  double total_sum = 0.0;
  for (int i = 0; i < MAX_PARALLEL_THREADS; i  ) {
    threads[i].join();
    total_sum  = sums[i];
  }
  return total_sum / (right - left);
}

double integrate_parallel(double left, double right, double step) {
  double sums[MAX_PARALLEL_THREADS];
  int thread_count = 0;

  omp_set_num_threads(MAX_PARALLEL_THREADS);
  #pragma omp parallel
  {
    thread_count = omp_get_num_threads(); // 0 is impossible, there is always 1 thread minimum...
    int interval_index = omp_get_thread_num();
    double ileft = left   (right - left) / thread_count * interval_index;
    double iright = left   (right - left) / thread_count * (interval_index   1);
    sums[interval_index] = sum_interval(ileft,iright,step);
  }
  
  double total_sum = 0.0;
  for (int i = 0; i < thread_count; i  ) {
    total_sum  = sums[i];
  }
  return total_sum / (right - left);
}

int main (int argc, const char* argv[]) {
  double left = -1.0;
  double right = 1.0;
  double step = 1.0E-9;

  // run single threaded calculation
  std::cout << "single" << std::endl;
  double tstart = omp_get_wtime();
  double i_single = integrate_single_threaded(left, right, step);
  double tend = omp_get_wtime();
  double st_time = tend - tstart;

  // run multi threaded calculation
  std::cout << "multi" << std::endl;
  tstart = omp_get_wtime();
  double i_multi = integrate_multi_threaded(left, right, step);
  tend = omp_get_wtime();
  double mt_time = tend - tstart;

  // run omp calculation
  std::cout << "omp" << std::endl;
  tstart = omp_get_wtime();
  double i_omp = integrate_parallel(left, right, step);
  tend = omp_get_wtime();
  double omp_time = tend - tstart;
  
  std::cout
    << "i_single: " << i_single
    << " st_time: " << st_time << std::endl
    << "i_multi: " << i_multi
    << " mt_time: " << mt_time << std::endl
    << "i_omp: " << i_omp
    << " omp_time: " << omp_time << std::endl;
  
  return 0;
}

When I compile this on my Debian with g --std=c 17 -Wall -O3 -lpthread -fopenmp -o para para.cpp -pthread, I get the following results:

single
multi
omp
i_single: 3.14159e 09 st_time: 2.37662
i_multi: 3.14159e 09 mt_time: 0.635195
i_omp: 3.14159e 09 omp_time: 0.660593

So, at least my conclusion is, that it is not worth the effort to learn openMP, given that the (more general use) std::thread version looks just as nice and performs at least equally well.

I am not really trusting the computed integral result in either case, though. But I did not really focus on that. They all produce the same value. That is the important part.