openMp basic and different outputs-CodePudding

I just started learning openMP and and, as most newbies to parallel programming, I was writing a simple script to parallelize solving an integral. The following is the code that I wrote:

        num_chunks = 1000;
        int threads_assigned = 0;
        int num_threads = 10;// omp_get_num_threads();
        sum_arr = (double *) malloc(sizeof(double) * num_threads);
        memset(sum_arr, 0, num_threads);
        int sums_each_thread = num_chunks / num_threads;
        printf("num of threads = %d\n", num_threads);
        sum = 0.0;
        omp_set_num_threads(num_threads);
        chunk_size = 1.0/(double) num_chunks;
        double start_time = omp_get_wtime();
        #pragma omp parallel
        {
                int idx = omp_get_thread_num();
                if (idx == 0) threads_assigned = omp_get_num_threads();
                int sums_count;
                int loop_size = idx * sums_each_thread;
                for (sums_count = loop_size; sums_count < loop_size   sums_each_thread; sums_count  ) {
                        printf("starting thread no %d\n", idx);
                        // double x = (idx * chunk_size)   (chunk_size / 2.0); simplified version in the next line
                        double x = chunk_size * (sums_count   0.5);
                        double y = 4.0 / (1   x*x);
                        sum_arr[idx - 1]  = (y * chunk_size);
                        sum  = y;
                }
                printf("ending thread no %d\n", idx);
        }
        double end_time = omp_get_wtime();
        printf("sum = %f, in time %f(s)\n", sum * chunk_size, end_time - start_time);
        int new_idx = 0;
        sum = 0.0;
        for(new_idx = 0; new_idx < num_threads; new_idx  ) {
                sum  = sum_arr[new_idx];
        }
        printf("arr sum = %f, with total assigned threads = %d\n", sum, threads_assigned);

I calculated the area by:

1 ) computing the sum of each area of rectangle.

2 ) computing the sum of all the heights (scalar sum) and then multiple by the chunk_size.

Mathematically both of the approaches yields same result but in this case of the above code, approach num 2 always shows the correct result and the approach num 1 fails.

Can anyone explain the cause?

CodePudding user response：

There are two errors in your code:

One is accessing invalid entry in arr_sum where element -1 is accessed by the master thread. Replace

sum_arr[idx - 1]  = (y * chunk_size);

with

sum_arr[idx]  = (y * chunk_size);

Another problem is modification of sum without synchronization. Fetch previous sum and storing the updated value could be interleaved with similar operation by other thread. This will corrupt the result. This effect should be observable after removing printf("starting thread no %d\n", idx); from the inner loop. To fix it just use atomic updates.

#pragma omp atomic
sum  = y;

Now sums computed in both ways are fine.

However, the program does scale poorly. One reason is "false sharing" on arr_sum when threads access the same line of cache causing costly synchronization at CPU level.

Another problem is using atomic instruction in critical sum = y which suffers from the very same problems.

Probably what you need is using reduction feature of OpenMP:

    #pragma omp parallel for reduction( :sum)
    for (int i = 0; i < num_chunks;   i) {
        double x = chunk_size * (i   0.5);
        double y = 4.0 / (1   x*x);
        sum  = y;
    }

Now it scales perfectly and is many (~100) times faster than the corrected loop from OP's question. It's so fast that num_chunks > 10e7 is required to make the loop a significant part of execution time of the program.