Home > Software design >  Array Padding does not mitigate false sharing? C, OpenMP
Array Padding does not mitigate false sharing? C, OpenMP

Time:05-17

#include <stdio.h>
#include <omp.h>
static long num_steps = 100000000; double step;
#define PAD 8
#define NUM_THREADS 6
void main(){
int i, nthreads; double pi=0, sum[NUM_THREADS][PAD]={0};
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);

//Starting Timer
double time_start = omp_get_wtime();

#pragma omp parallel
{
    int i, id, nthrds;
    double x;
    id = omp_get_thread_num();
    nthrds = omp_get_num_threads();
    if(id==0) nthreads = nthrds;
    for(i=id;i<num_steps;i=i nthrds){
        x = (i 0.5)*step;
        sum[id][0]  = 4.0/(1.0 x*x);
    }
}
for(i=0; i<nthreads; i  )pi  =sum[i][0]*step;

//Ending Timer
double time_end = omp_get_wtime();

double timepass = time_end-time_start;

//New Run, how many threads
printf("Integration Program runs with %d threads\n", nthreads);

//Print Result of Integral
printf("Integration Result: %lf\n", pi);

//Print How much Time has passed
printf("%lf Time passed for Integration...\n", timepass);

//Print Effective Time
printf("Effective Total Time: %lf\n\n", timepass*nthreads);
}

This snippet of code is taken from an OpenMP tutorial by Tim Matson. This code integrates the function 4.0/(1 x*x) but holds each partial result in a 2d-array named sum. I use a linux machine and have checked I have the standard 64 bit cache lines on L1, L2, and L3. I compiled using gcc, no optimizations and was expecting runtime to decrease. This is what I got for the runtime:

1 threads: 0.356362

2 threads: 0.541903

3 threads: 0.416097

4 threads: 0.346139

5 threads: 0.286879

6 threads: 0.315139

It seems that false sharing still occurs even with the padding and I am confused why. I have changed the padding to larger sizes and performance scalability is similarly poor. The only thing that seems to fix the poor scalability problem is by turning on the compiler optimizations, even just the -O1 would make the code scale great. I am not sure why this is the case though.

CodePudding user response:

I wonder if the story about false sharing needs to be revisited. I've adapted the code to

#ifndef PAD
#define PAD 8
#endif

#ifndef NTHREADS
#define NTHREADS 6
#endif

void main(){
  int i, nthreads; double pi=0, sum[NTHREADS][PAD]={0};
  step = 1.0/(double) num_steps;
  omp_set_num_threads(NTHREADS);

also:

  printf("Integration Program runs with %d threads, padding=%d\n", nthreads,PAD);

so that I can run a quick shell loop:

for p in 1 2 3 4 5 6 7 8 ; do
    ## compile with -DPAD=$p -DNTHREADS=whatever

and this is what I get:

Integration Program runs with 56 threads, padding=1
Integration Result: 3.141593
0.006488 Time passed for Integration...
Effective Total Time: 0.363319

Integration Program runs with 56 threads, padding=2
Integration Result: 3.141593
0.006484 Time passed for Integration...
Effective Total Time: 0.363106

Integration Program runs with 56 threads, padding=3
Integration Result: 3.141593
0.006213 Time passed for Integration...
Effective Total Time: 0.347925

Integration Program runs with 56 threads, padding=4
Integration Result: 3.141593
0.006125 Time passed for Integration...
Effective Total Time: 0.342999

Integration Program runs with 56 threads, padding=5
Integration Result: 3.141593
0.006641 Time passed for Integration...
Effective Total Time: 0.371904

Integration Program runs with 56 threads, padding=6
Integration Result: 3.141593
0.006988 Time passed for Integration...
Effective Total Time: 0.391317

Integration Program runs with 56 threads, padding=7
Integration Result: 3.141593
0.006617 Time passed for Integration...
Effective Total Time: 0.370556

Integration Program runs with 56 threads, padding=8
Integration Result: 3.141593
0.006138 Time passed for Integration...
Effective Total Time: 0.343719

In other words: with modern processors false sharing is no longer a problem. The processor keeps a separate accumulator on each core and does not write to the falsely shared locations until it's absolutely necessary.

CodePudding user response:

@Victor Eijkhout I think what you are doing is an interesting idea. I gave it a try: I kept everything the same, except for this line:

//New Run, how many threads
printf("Integration Program runs with %d threads and %d PADS\n", nthreads, PAD);

After that, I varied the PAD for a fixed number of threads. This is the result I got:

Integration Program runs with 1 threads and 1 PADS
Integration Result: 3.141593
0.353521 Time passed for Integration...
Effective Total Time: 0.353521

Integration Program runs with 3 threads and 1 PADS
Integration Result: 3.141593
0.628157 Time passed for Integration...
Effective Total Time: 1.884472

Integration Program runs with 3 threads and 2 PADS
Integration Result: 3.141593
0.773415 Time passed for Integration...
Effective Total Time: 2.320245

Integration Program runs with 3 threads and 3 PADS
Integration Result: 3.141593
0.701447 Time passed for Integration...
Effective Total Time: 2.104341

Integration Program runs with 3 threads and 4 PADS
Integration Result: 3.141593
0.315380 Time passed for Integration...
Effective Total Time: 0.946139

Integration Program runs with 3 threads and 5 PADS
Integration Result: 3.141593
0.351690 Time passed for Integration...
Effective Total Time: 1.055071

Integration Program runs with 3 threads and 6 PADS
Integration Result: 3.141593
0.374674 Time passed for Integration...
Effective Total Time: 1.124023

Integration Program runs with 3 threads and 7 PADS
Integration Result: 3.141593
0.295932 Time passed for Integration...
Effective Total Time: 0.887795

Integration Program runs with 3 threads and 8 PADS
Integration Result: 3.141593
0.126768 Time passed for Integration...
Effective Total Time: 0.380304

Integration Program runs with 3 threads and 9 PADS
Integration Result: 3.141593
0.127145 Time passed for Integration...
Effective Total Time: 0.381434

Integration Program runs with 3 threads and 9 PADS
Integration Result: 3.141593
0.407976 Time passed for Integration...
Effective Total Time: 1.223927

Integration Program runs with 3 threads and 9 PADS
Integration Result: 3.141593
0.125064 Time passed for Integration...
Effective Total Time: 0.375193

Integration Program runs with 3 threads and 9 PADS
Integration Result: 3.141593
0.410047 Time passed for Integration...
Effective Total Time: 1.230141

Please take a look at the very first result, 1 thread, 1 pad, would run at 0.35 s approximately. The last 4 entry confuses me extremely. How could I have the performance soemtimes be very good but sometimes be very bad? It makes no sense.

  • Related