I'm recently doing this problem, taken directly and translated from day 1 task 3 of IOI 2010, "Quality of life", and I encountered a weird phenomenon.
I was setting up a 0-1 matrix and using that to calculate a prefix sum matrix in 1 loop:
for (int i = 1; i <= m; i )
{
for (int j = 1; j <= n; j )
{
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
b[i][j] = b[i-1][j] b[i][j-1] - b[i-1][j-1] lower[i][j];
}
}
and I got TLE (time limit exceeded) on 4 tests (the time limit is 2.0s). While using 2 for loop seperately:
for (int i = 1; i <= m; i )
{
for (int j = 1; j <= n; j )
{
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
}
}
for (int i = 1; i <= m; i )
{
for (int j = 1; j <= n; j )
{
b[i][j] = b[i-1][j] b[i][j-1] - b[i-1][j-1] lower[i][j];
}
}
got me full AC (accepted).
As we can see from the 4 pictures here:
TLE result, picture 1 : https://i.stack.imgur.com/9o5C2.png
TLE result, picture 2 : https://i.stack.imgur.com/TJwX5.png
AC result, picture 1 : https://i.stack.imgur.com/1fo2H.png
AC result, picture 2 : https://i.stack.imgur.com/CSsZ2.png
the 2 for-loops code generally ran a bit faster (even in accepted test cases), contrasting my logic that the single for-loop should be quicker. Why does this happened?
Full code (AC) : https://pastebin.com/c7at11Ha (Please ignore all the nonsense bit and stuff like using namespace std;
, as this is a competitive programming contest).
- Note : The judge server, lqdoj.edu.vn is built on dmoj.ca, a global competitive programming contest platform.
CodePudding user response:
if you look at assembly you will see the source of the difference
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
b[i][j] = b[i-1][j] b[i][j-1] - b[i-1][j-1] lower[i][j];
in this case, there is a data dependency. the assignment to b depends on a value from the assignment to lower. so the operations go sequentially in the loop - first assignment to lower, then to b. a compiler cannot optimize this code significantly because of the dependency.
separation of assignments into 2 loops
the assignment to lower is now independent and the compiler can use simd instructions that give a performance boost. the second loop stays more or less similar to the original assembly