My OpenMP Implementation shows a really bad performance. When I profile it with vtune, I have a super low CPU usage and I don't know why. Does anyone have an idea?
Hardware:
- NUMA architecture with 28 cores (56 Threads)
Implementation:
struct Lineitem {
int64_t l_quantity;
int64_t l_extendedprice;
float l_discount;
unsigned int l_shipdate;
};
Lineitem* array = (Lineitem*)malloc(sizeof(Lineitem) * array_length);
// array will be filled
#pragma omp parallel for num_threads(48) shared(array, array_length, date1, date2) reduction( : sum)
for (unsigned long i = 0; i < array_length; i )
{
if (array[i].l_shipdate >= date1 && array[i].l_shipdate < date2 &&
array[i].l_discount >= 0.08f && array[i].l_discount <= 0.1f &&
array[i].l_quantity < 24)
{
sum = (array[i].l_extendedprice * array[i].l_discount);
}
}
Additionally as information, I am using cmake and clang.
CodePudding user response:
Modern CPUs will only show high performance if there is lots of cache data to be reused. Since you are only operating linearly on an array, there is no such thing and you are limited by memory bandwdith. Your cores will indeed be operating at a small fraction of their full utilization.
Things may be even worse: you have an array of structures from which you use certain fields. If there are other fields that you don't use, you get the phenomenon that you don't fully use the cachelines that you load from memory, dividing the performance yet again by a factor. Please amend your question by including the data layout of your structure/class.