Home > Blockchain >  Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C
Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C

Time:09-21

In C, I reduced the total number of loop executions by nearly 3 times, but through testing the execution time, I found that there is almost no improvement in doing so. All optimization levels have been tested, and the results are basically the same(including O0, O1, O2 and O3). I guess it’s a problem with the compiler, but I want to know what causes this situation. And what to do to make the results meet expectations.

The code is as follow:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

#define Len 10000000

// Two variables that count the number of loops
int count1 = 0;
int count2 = 0;

int main(int argc, const char * argv[]) {
    srandom((unsigned)time(NULL));
    
    // An array to increase the index,
    // the range of its elements is 1-256
    int rand_arr[128];
    for (int i = 0; i < 128;   i)
        rand_arr[i] = random()%256 1;
    
    // A random text, the range of its elements is 0-127
    char *tex = malloc((sizeof *tex) * Len);
    for (int i = 0; i < Len;   i)
        tex[i] = random()%128;
    
    // The first testing
    clock_t start = clock();
    for (int i = 0; i < Len; i  = rand_arr[tex[i]])
        count1  ;
    printf("No.1: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
    
    // The second testing (x3)
    start = clock();
    for (int i = 0; i < Len; i  = rand_arr[tex[i]] 256)
        count2  ;
    printf("No.2: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
    
    printf("count1: %d\n", count1);
    printf("count2: %d\n", count2);
    
    return 0;
}

The print result(average) is as follows:

No.1: 0.002213 s
No.2: 0.002209 s
count1: 72661
count2: 25417

CodePudding user response:

The problem comes from the processor itself and not the compiler. This is a complex problem related to the behaviour of the CPU caches, CPU prefetching units and the random access pattern.

Both codes read the tex array based i value that cannot be easily predicted ahead of time by the processor because of the random increments stored in the rand_arr. Because tex is relatively big, it will likely not be fully stored in L1 cache (nor in an intermediate L2 cache if any) but rather in the last-level cache (LLC) or even in RAM. As a result, tex need to be reloaded in each loops from the LLC cache or the RAM. The latency of the LLC cache and the RAM are relatively big nowadays. This thing is that the second loop is harder to predict and less cache-friendly than the first although the number of iteration is smaller!

On x86 CPU, caches pack values by block of 64 bytes called a cache line. When a value is fetched from the main memory or another cache (typically due to a cache miss), a full cache line is fetched. The following accesses to the same cache line are faster because the CPU do not need to fetch it again (as long as the cache line is not invalidated).

In the first loop, the average increment of i is 128 (because the mean of rand_arr is 128). This means that the average stride between two fetched item from tex is 128. In the worst case, the stride is 256. In the second loop, the average stride is 256 128=384 and in the worst case it is 256 256=512. When the stride is smaller than 64, there is a high probability that is will be already fetched in the first case while this is never the case in the second loop. Moreover, the prefetching units can prefetch cache line when several access are contiguous or close each other. This enable the processor to most items of the tex array ahead of time in the first loop. Meanwhile, in the second loop the prefetcher will likely fail to recognize any cache line fetching access. The prefetching units will probably not prefetch anything (because it is too expensive to do that) and the result is many cache misses with a high latency that cannot be mitigated because the accesses are inherently sequential and unpredictible. If the prefeteching units decide to prefetch all the cache lines, then the second loop should not be faster than the first (because the two loops are both bound by the memory hierarchy).

Note that random and srandom are not standard functions. Also, be aware that clock is not always precise on all platforms. Actually, it as a precision of 1 ms on my Windows (using GCC and MinGW). This can also be seen on some Linux systems.

CodePudding user response:

There are a couple of things that could be causing this:

  1. If you're timing your code, you should have:

    startTime = clock();
    CODE
    endTime = clock();
    

Then print your results/do analysis on it after. You do some math on it and use the PRINTF function which is horrifically inefficient as far as timing goes. Also the cast to double is not necessary, and can be causing the vast majority of your timing, as double math is horrifically slow. Stick to int as this is potentially 1000x faster

  1. odd for loop code - the standard for for loop equations is:

    for(int i = 0;i<length;i  )
          Code
    

You have

  for(int i = 0;i<length;code)
       i  ;

Which is odd as far as syntax goes, and may be affecting your timing

  1. clock() - this may be affecting your timing. If clock() is returning a double I would suggest doing it a different way, with a function that returns int or unsigned int, as doubles will destroy your timing as stated above. If you're worried about this, I'd suggest testing it by way of the following:

    startTime = clock()
    for(int i = 0;i<10000;i  )
         dummy = clock()
    endTime = clock()
    totalTime = (endTime - startTime)/10000;
    
  2. for loop - this in itself can be the main source of your base-timing (although it's unlikely, especially since it doesn't seem like you're doing any particularly complicated math. You can fix this buy using the #pragma unroll processor instruction, which will basically copy and paste all iterations of your for loop into your code, removing it's timing affects

  • Related