Home > Blockchain >  Why does the time increase when using more threads in OpenMP
Why does the time increase when using more threads in OpenMP

Time:10-24

What I try to do here is to understand OpenMP, so I wrote a simple program which compares the calculating times of parallelization for an matrix-vector multiplication. It is running with different sizes for the matrix (1024,2048,8192), with a different amount of threads (1,2,4,8) and with different scheduling strategies (static, dynamic, guided). I ran the program on a machine with two cores and 4 threads.

The times are:
Time for 1 threads with 1024 entries and scheduling 0: 26720 ticks
Time for 1 threads with 8192 entries and scheduling 0: 1486755 ticks
Time for 2 threads with 1024 entries and scheduling 0: 159161 ticks
Time for 2 threads with 8192 entries and scheduling 0: 22254787 ticks

But that does not make sense the the amount of cpu ticks increases around 5 to 15 times when increasing the threads from one to two. The times are a little better for 4 and 8 Threads again.

The code is

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif

void matrix(unsigned int n)
{
  // The big arrays we want on the heap
  float *matrix = (float *)malloc(sizeof(float) * n * n);
  float *vector = (float *)malloc(sizeof(float) * n);
  float *result = (float *)malloc(sizeof(float) * n);

#pragma omp parallel for
  // initialize matrix
  for (int row = 0; row < n; row  )
  {
    for (int column = 0; column < n; column  )
    {
      *(matrix   (row * n)   column) = rand();
    }
  }

// initialize vectors
#pragma omp parallel for
  for (int row = 0; row < n; row  )
  {
    *(vector   row) = rand();
    *(result   row) = 0;
  }

// multiply
#pragma omp parallel for
  for (int row = 0; row < n; row  )
  {
    for (int column = 0; column < n; column  )
    {
      float resultat = *(matrix   (row * n)   column) * *(vector   column);
      *(result   row)  = resultat;
    }
  }
}

int main()
{
  time_t t_t;

  // Initialisieren Zufallsgenerator
  srand((unsigned)time(&t_t));

  unsigned int threads[] = {1, 2, 4, 8};
  unsigned int amounts[] = {1024, 2048, 8192};
  omp_sched_t schedules[] = {omp_sched_static,
                             omp_sched_dynamic,
                             omp_sched_guided};

  size_t size_threads = sizeof(threads) / sizeof(threads[0]);
  size_t size_amounts = sizeof(amounts) / sizeof(amounts[0]);
  size_t size_schedules = sizeof(schedules) / sizeof(schedules[0]);

  // Anzahl Threads variieren
  for (int t = 0; t < size_threads; t  )
  {
    omp_set_num_threads(threads[t]);
    for (int a = 0; a < size_amounts; a  )
    {
      for (int s = 0; s < size_schedules; s  )
      {
        omp_set_schedule(schedules[s], 0);
        clock_t start_t = clock();
        matrix(amounts[a]);
        clock_t end_t = clock();
        printf("Time for %d threads with %d entries and scheduling %d: %ld ticks\n\a", threads[t], amounts[a], s, (end_t - start_t));
      }
    }
  }

  return 0;
}

Is there a mistake in my code or an other explanation for this behavior?

Edit: I also tried the gettimeofday() function like

  struct timeval start_time;
  struct timeval end_time;

  ...
  gettimeofday(&start_time, NULL);
  matrix(amounts[a]);
  gettimeofday(&end_time, NULL);
  ...

  printf("Time for %d threads with %d entries and scheduling %d: %f s\n\a", threads[t], amounts[a], s, (double)(end_time.tv_sec - start_time.tv_sec)   (double)(end_time.tv_usec - start_time.tv_usec)/1000000);

with the basically same results:

Time for 1 threads with 1024 entries and scheduling 0: 0.024589 s
Time for 1 threads with 8192 entries and scheduling 0: 1.393275 s
Time for 2 threads with 1024 entries and scheduling 0: 0.117452 s
Time for 2 threads with 8192 entries and scheduling 0: 25.067069 s

CodePudding user response:

I think this is related to the following question: OpenMP time and clock() give two different results

The thing is, probably clock() returns the total time spent on the CPU which becomes more as more threads are active. The real time spent is not much related to that number. I suggest to use gettimeofday() function to measure real times and compare your result with clock().

CodePudding user response:

Another problem is that OpenMP has overhead to setup the threads and distribute the work among threads. You need a reasonable amount of work, otherwise overheads are bigger than the gain by parallelization.

CodePudding user response:

In addition to what was said, memory contention probably also plays a role, influenced by the amount of L1 and L2 cache

  • Related