I want to create and initialize an array of 1024 elements, and I'm exploring which way is the most efficient in terms of execution time.
I am working with ARM Neon, using arrays of structures like uint16x8x4_t
, which are
of the form
typedef struct uint16x8x4_t
{
uint16x8_t val[4];
} uint16x8x4_t;
and I have three scenarios:
Scenario 1:
I initialize an array of 1024 elements of uint16x8x4_t
like
uint16x4x4_t arrayTest01[1024] = {
{ { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } },
{ { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } },
//... (1020 more times) ...
{ { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } },
{ { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } }
};
In this scenario, I access the elements as arrayTest01[0].val[1][2] = 999
.
Scenario 2:
I create an array of pointers, then allocate memory and finally assign values.
// First: Create array of pointers
uint16x4x4_t* arrayTest02[1024];
// Second: Allocate all the memory (individual allocation)
arrayTest02[0] = malloc(sizeof(uint16x4x4_t));
arrayTest02[1] = malloc(sizeof(uint16x4x4_t));
arrayTest02[2] = malloc(sizeof(uint16x4x4_t));
//... (all indexes until 1022) ...
arrayTest02[1023] = malloc(sizeof(uint16x4x4_t));
// Third: Assign values to each array (using dereference)
(*arrayTest02[0]) = (uint16x4x4_t){ {{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}} };
(*arrayTest02[1]) = (uint16x4x4_t){ {{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}} };
(*arrayTest02[2]) = (uint16x4x4_t){ {{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}} };
//... (all indexes until 1022) ...
(*arrayTest02[1023]) = (uint16x4x4_t){ {{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}} };
In this scenario, I access the elements as (*arrayTest02[0]).val[1][2] = 999
.
Scenario 3:
I create an array of pointers, then create thousands of individual arrays, and I populate the array of pointers with memory addresses.
// First: Create array of pointers
uint16x4x4_t* arrayTest03[1024];
// Second: Create individual arrays with unique names
uint16x4x4_t arrayTest03_01 = { { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } };
uint16x4x4_t arrayTest03_02 = { { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } };
uint16x4x4_t arrayTest03_03 = { { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } };
// ... (all indexes until 1022) ...
uint16x4x4_t arrayTest03_1023 = { { {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} } };
// Third: Reference all of them to the array of pointers
arrayTest03[0] = &arrayTest03_01;
arrayTest03[1] = &arrayTest03_02;
//... (all indexes until 1022) ...
arrayTest03[1023] = &arrayTest03_1023;
In this scenario, I also access the elements as (*arrayTest03[0]).val[1][2] = 999
.
Questions
Now my questions:
- Which scenario would lead to better execution time?
I think in terms of execution time it will be (from lower to higher)
Scenario 1 <= Scenario 3 <= Scenario 2
mostly because of all the malloc
s of Scenario 2 and using unique names for so many arrays in Scenario 3, but I'm not sure. Also, I would have to free
all those allocations, which will impact the total execution time.
- Which scenario requires less memory in the whole program?
I think all of them are pretty similar, but probably
Scenario 1 <= Scenario 2 <= Scenario 3
because Scenario 2 doesn't have unique names as Scenario 3. Again not sure.
- Which scenario is more scalable?
I think in this case Scenario 2
beats the other two, because the variables are saved in the heap (because of malloc), while for the other two scenarios, everything is saved in the stack. If we increase the number from 1024 to a way bigger number (example 999,999), then the stack memory is too small. Is that correct? Or am I missing anything here?
- Any suggestion?
I think Scenario 1
beats the other two scenarios for when initializing an array, but is there anything additional that I should be careful of?
CodePudding user response:
1. Which scenario would lead to better execution time?
Scenario 1 is likely to be most efficient for initialization. If arrayTest01
has static storage duration then the initialization will be essentially free, but even if it has automatic or thread duration, initialization is probably via the equivalent of a single memcpy()
.
Scenario 1 is also likely to be the most efficient for subsequent access. Storage for the array and all its parts will be in one contiguous block (which is cache-friendly), and access will not involve indirection through pointers, other than possibly a single pointer to the overall array itself.
Scenario 2 is likely to be the most expensive for initialization. All those malloc()
calls need to be executed at runtime, and malloc()
is comparatively expensive. Also, each of the allocated objects needs to be initialized independently, via the equivalent of 1024 memcpy()
s instead of (probably) zero or one in Scenario 1.
Scenario 2 is also likely to be the most expensive for subsequent access. The allocated objects are not necessarily contiguous with each other in memory, hence not as cache-friendly as scenario 1, and there are additional pointer loads and indirect accesses relative to scenario 1.
Scenario 3 is likely to be intermediate for initialization. I would expect it to outperform scenario 2 on account of avoiding all the malloc()
s. With a clever optimizer, it is conceivable that initialization might be up to as fast as scenario 1.
Scenario 3 should not be any worse than scenario 2 for subsequent access, because the resulting data structures are the same. However, scenario 3 might benefit from more favorable arrangement of the data in memory, and the optimizer might be able to do a better job on some accesses. But scenario 3 still involves extra pointer loads and indirect memory accesses relative to scenario 1, so scenario 3 is not likely to afford accesses as efficient as scenario 1 affords.
2. Which scenario requires less memory in the whole program?
Scenarios 2 and 3 require storage for a bunch of pointers that Scenario 1 does not require. Scenario 2 also has some amount of additional overhead in the form of metadata for all the dynamic allocations. So 1 < 3 < 2.
3. Which scenario is more scalable?
This depends on more details than you have presented. Inasmuch as your remarks suppose stack allocation for at least the top-level array, however, scenario 2 requires the least space for that array itself, and all the rest of the needed storage is dynamically allocated. Scenario 1 has all the needed memory in one contiguous block, but requires a bit less storage overall. Scenario 3 has a higher overall storage requirement than scenario 1, and none of it is dynamically allocated.
Thus, if your scaling concern is about the amount of automatically allocated storage required then scenario 2 will scale the best, then scenario 1, then scenario 3. Do note that most C implementations for stack-based machines provide means to request larger stack sizes than the default, so this particular consideration is unlikely to be relevant if the needed array size is known in advance. And if the needed array size is not known in advance then the question is moot, because scenario 2 would then be the only viable option among the three presented.
On the other hand, if the concern is performance or overall memory required, then see above -- the relative ordering of the three scenarios on those measures is not scale dependent.
4. Any suggestion?
As a general rule, I would suggest avoiding dynamic allocation where it is not needed, but I can't say whether dynamic allocation is needed in your case.
I don't see any reason to prefer scenario 3 over scenario 1.
Whether that leaves you at scenario 1, scenario 2, or something altogether different depends on your application's specific requirements.
CodePudding user response:
OT: Is there some point in trying to make life difficult with an array of structs?
uint8_t arr[][4][4] = {
{ {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} },
{ {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,42} }, // <== NB '42'
//... (1020 more times) ...
{ {0,1,2,3},{0, },{8,9,10,11},{12,13,14,15} }, // <== NB '0'
{ {0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15} }
};
int main() {
arr[1][3][2] = 41; // ****
for( size_t x = 0; x < sizeof arr/sizeof arr[0]; x ) {
putchar( '[' );
for( size_t y = 0; y < sizeof arr[0]/sizeof arr[0][0]; y ) {
putchar( '[' );
for( size_t z = 0; z < sizeof arr[0][0]/sizeof arr[0][0][0]; z )
printf( "%d,", arr[x][y][z] ); // <== Simple, no?
putchar( ']' );
putchar( ',' );
}
putchar( ']' );
putchar( ',' );
putchar( '\n' );
}
return 0;
}
[[0,1,2,3,],[4,5,6,7,],[8,9,10,11,],[12,13,14,15,],],
[[0,1,2,3,],[4,5,6,7,],[8,9,10,11,],[12,13,41,42,],], <== Note '41 & '42'
[[0,1,2,3,],[0,0,0,0,],[8,9,10,11,],[12,13,14,15,],], <== Note '0's
[[0,1,2,3,],[4,5,6,7,],[8,9,10,11,],[12,13,14,15,],],