Radix-sort parallel algorithm C program openmp-CodePudding

I'm trying to parallelize the following Radix Sort algorithm C code using OpenMP but I have some doubts about using the OpenMP clauses. In particular, there are some loops where I doubt that they can be parallelized at all.

Here is the code I'm working on:

unsigned getMax(size_t n, unsigned arr[n]) {
    unsigned mx = arr[0];
    unsigned i;
    #pragma omp parallel for reduction(max:mx) private(i)
    for (i = 1; i < n; i  )
        if (arr[i] > mx)
            mx = arr[i];
    return mx;
}
 
void countSort(size_t n, unsigned arr[n], unsigned exp) {
    unsigned output[n]; // output array
    int i, count[10] = { 0 };
    // Store count of occurrences in count[]
    #pragma omp parallel for private(i)
    for (i = 0; i < n; i  ) {
        #pragma omp atomic
        count[(arr[i] / exp) % 10]  ; }
 
    for (i = 1; i < 10; i  )
        count[i]  = count[i - 1];
 
    // Build the output array
    #pragma omp parallel for private(i)
    for (i = (int) n - 1; i >= 0; i--) {
        #pragma omp atomic write
        output[count[(arr[i] / exp) % 10] - 1] = arr[i];
        count[(arr[i] / exp) % 10]--;
    }
    
    #pragma omp parallel for private(i)
    for (i = 0; i < n; i  )
        arr[i] = output[i];
}
 
// The main function to that sorts arr[] of size n using Radix Sort
void radixsort(size_t n, unsigned arr[n], int threads) {
    omp_set_num_threads(threads);
    unsigned m = getMax(n, arr);
    unsigned exp;
    for (exp = 1; m / exp > 0; exp *= 10)
        countSort(n, arr, exp);
}

In particular, I'm not sure if for loops like the following can be parallelized or not:

    for (i = 1; i < 10; i  )
            count[i]  = count[i - 1];

    #pragma omp parallel for private(i)
    for (i = (int) n - 1; i >= 0; i--) {
        #pragma omp atomic write
        output[count[(arr[i] / exp) % 10] - 1] = arr[i];
        count[(arr[i] / exp) % 10]--;
    }

I'm asking for help on the specific OMP clauses I should use; other comments on the code shown are also welcome.

CodePudding user response：

First of all to parallelize a code a reasonable amount of work is needed, otherwise the parallel overheads are bigger than the gain by parallelization. This is definitely the case in your example, since you create the output array on stack (so it cannot be big enough). Comments on your code:

Both loops you mention in your question depend on the order of execution, so they cannot be parallelized easily/efficiently. Note also that there is a race condition when count array is accessed.
If you select a base which is a power of 2 (2^k), you can get rid off expensive integer division and you can use fast bitwise/shift operators instead.
Always define your variables in their minimal required scope. So instead of

unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i  ) ....

the following code is preferred:

#pragma omp parallel for reduction(max:mx)
for (unsigned i = 1; i < n; i  ) ....

To copy your array, memcpy can be used: memcpy(arr,output,n*sizeof(output[0]))
In this loop

#pragma omp parallel for private(i)
    for (i = 0; i < n; i  ) {
        #pragma omp atomic
        count[(arr[i] / exp) % 10]  ; }

you can use reduction instead of atomic operation:

#pragma omp parallel for private(i) reduction( :count[10])
    for (i = 0; i < n; i  ) {
        count[(arr[i] / exp) % 10]  ; }

CodePudding user response：

Radix sort can be parallelized if you split up the data. One way to do this is to use a most significant digit radix sort for the first pass, to create multiple logical bins. For example, if using base 256 (2^8), you end up with 256 bins, which radix sort can then sort in parallel, based on the number of logical cores on your system. With 4 cores, you can sort 4 bins at a time. This relies on having somewhat uniform distribution of the most significant digit, otherwise so that the bins are somewhat equal in size.

Trying to optimize the first pass may not help much, since you'll need atomic read|write for the to update a bin index, and the random access writes to anywhere in the destination array will create cache conflicts.