Allocating matrix performances-CodePudding

I have two scenarios, in both i allocate 78*2 sizeof(int) of memory and initialize it to 0. Are there any differences regards performances?

Scenario A:

int ** v = calloc(2 , sizeof(int*));

    for (i=0; i<2;   i)
    {
        v[i] = calloc(78, sizeof(int));
    }

Scenario B:

int ** v = calloc(78 , sizeof(int*));

    for (i=0; i<78;   i)
    {
        v[i] = calloc(2, sizeof(int));
    }

I supposed that in performance terms, it's better to use a calloc if an initialize array is needed, let me know if I'm wrong

CodePudding user response：

First, discussing optimization abstractly has some difficulties because compilers are becoming increasingly better at optimization. (For some reason, compiler developers will not stop improving them.) We do not always know what machine code given source code will produce, especially when we write source code today and expect it to be used for many years to come. Optimization may consolidate multiple steps into one or may omit unnecessary steps (such as clearing memory with calloc instead of malloc immediately before the memory is completely overwritten in a for loop). There is a growing difference between what source code nominally says (“Do these specific steps in this specific order”) and what it technically says in the language abstraction (“Compute the same results as this source code in some optimized fashion”).

However, we can generally figure that writing source code without unnecessary steps is at least as good as writing source code with unnecessary steps. With that in mind, let’s consider the nominal steps in your scenarios.

In Scenario A, we tell the computer:

Allocate 2 int *, clear them, and put their address in v.
Twice, allocate 78 int, clear them, and put their addresses in the preceding int *.

In Scenario B, we tell the computer:

Allocate 78 int *, clear them, and put their address in v.
78 times, allocate two int, clear them, and put their addresses in the preceding int *.

We can easily see two things:

Both of these scenarios both clear the memory for the int * and immediately fill it with other data. That is wasteful; there is no need to set memory to zero before setting it to something else. Just set it to something else. Use malloc for this, not calloc. malloc takes just one parameter for the size instead of two that are multiplied, so replace calloc(2, sizeof (int *)) with malloc(2 * sizeof (int *)). (Also, to tie the allocation to the pointer being assigned, use int **v = malloc(2 * sizeof *v); instead of repeating the type separately.)
At the step where Scenario B does 78 things, Scenario A does two things, but the code is otherwise very similar, so Scenario A has fewer steps. If both would serve some purpose, then A is likely preferable.

However, both scenarios allude to another issue. Presumably, the so-called array will be used later in the program, likely in a form like v[i][j]. Using this as a value means:

Fetch the pointer v.
Calculate i elements beyond that.
Fetch the pointer at that location.
Calculate j elements beyond that.
Fetch the int at that location.

Let’s consider a different way to define v: int (*v)[78] = malloc(2 * sizeof *v);.

This says:

Allocate space for 2 arrays of 78 int and put their address in v.

Immediately we see that involves fewer steps than Scenario A or Scenario B. But also look at what it does to the steps for using v[i][j] as a value. Because v is a pointer to an array instead of a pointer to a pointer, the computer can calculate where the appropriate element is instead of having to load an address from memory:

Fetch the pointer v.
Calculate i•78 elements beyond that.
Calculate j elements beyond that.
Fetch the int at that location.

So this pointer-to-array version is one step fewer than the pointer-to-pointer version.

Further, the pointer-to-pointer version requires an additional fetch from memory for each use of v[i][j]. Fetches from memory can be expensive relative to in-processor operations like multiplying and adding, so it is a good step to eliminate. Having to fetch a pointer can prevent a processor from predicting where the next load from memory might be based on recent patterns of use. Additionally, the pointer-to-array version puts all the elements of the 2×78 array together in memory, which can benefit the cache performance. Processors are also designed for efficient use of consecutive memory. With the pointer-to-pointer version, the separate allocations typically wind up with at least some separation between the rows and may have a lot of separation, which can break the benefits of consecutive memory use.