Originally this is a problem coming up in mathematica.SE, but since multiple programming languages have involved in the discussion, I think it's better to rephrase it a bit and post it here.
In short, michalkvasnicka found that in the following MATLAB sample
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for c = 1:s
for r = 1:s
H(r,c) = 1/(r c-1);
end
end
toc
%Elapsed time is 1.359625 seconds.... For-loop
tic;
% vectorized version
c = 1:s;
r = c';
HH=1./(r c-1);
toc
%Elapsed time is 0.047916 seconds.... Vectorized
isequal(H,HH)
the vectorized code piece is more than 25 times faster than the pure for-loop code piece. Though I don't have access to MATLAB so cannot test the sample myself, the timing 1.359625
seems to suggest it's tested on an average PC, just as mine.
But I cannot reproduce the timing with other languages like fortran or julia! (We know, both of them are famous for their performance of numeric calculation. Well, I admit I'm by no means an expert of fortran or julia. )
The followings are the samples I used for test. I'm using a laptop with i7-8565U CPU, Win 10.
fortran
fortran code is compiled with gfortran (TDM-GCC-10.3.0-2, with compile option -Ofast).
program tst
use, intrinsic :: iso_fortran_env
implicit none
integer,parameter::s=15000
integer::r,c
real(real64)::hmn(s,s)
do r=1,s
do c=1, s
hmn(r,c)=1._real64/(r c - 1)
end do
end do
print *, hmn(s,s)
end program
compilation timing: 0.2057823
seconds
execution timing: 0.7179657
seconds
julia
Version of julia is 1.6.3.
@time (s=15000; Hmm=[1. /(r c-1) for r=1:s,c=1:s];)
Timing: 0.7945998
seconds
Here comes the question:
Is the timing of MATLAB reliable?
If the answer to 1st question is yes, then how can we reproduce the performance (for 2 GHz CPU, the timing should be around
0.05
seconds) with julia, fortran, or any other programming languages?
CodePudding user response:
tic
/toc
should be fine, but it looks like the timing is being skewed by memory pre-allocation.
I can reproduce similar timings to your MATLAB example, however
On first run (
clear
workspace)- Loop approach takes 2.08 sec
- Vectorised approach takes 1.04 sec
- Vectorisation saves 50% execution time
On second run (workspace not cleared)
- Loop approach takes 2.55 sec
- Vectorised approach takes 0.065 sec
- Vectorisation "saves" 97.5% execution time
My guess would be that since the loop approach explicitly creates a new matrix via zeros
, the memory is reallocated from scratch on every run and you don't see the speed improvement on subsequent runs.
However, when HH
remains in memory and the HH=___
line outputs a matrix of the same size, I suspect MATLAB is doing some clever memory allocation to speed up the operation.
We can prove this theory with the following test:
Test Num | Workspace cleared | s | Loop (sec) | Vectorised (sec)
1 | Yes | 15000 | 2.10 | 1.41
2 | No | 15000 | 2.73 | 0.07
3 | No | 15000 | 2.50 | 0.07
4 | No | 15001 | 2.74 | 1.73
See the variation between tests 2 and 3, this is why timeit
would have been helpful for an average runtime (see footnote). The difference in output sizes between tests 3 and 4 are pretty small, but the execution time returns to a similar magnitude of that in test 1 for the vectorised approach, suggesting that the re-allocation to create HH
costs most of the time.
Footnote: tic
/toc
timings in MATLAB can be improved by using the in-built timeit
function, which essentially takes an average over several runs. One interesting thing to observe from the workings of timeit
though is that it explicitly "warms up" (quoting a comment) the tic
/toc
function by calling it a couple of times. You can see when running tic
/toc
a few times from a clear workspace (with no intermediate code) that the first call takes longer than subsequent calls, as there must be some overhead for getting the timer initialised.
CodePudding user response:
I hope that the following modified benchmark could bring some new light to the problem:
s = 15000;
tic
% for-loop version
H = zeros(s,s);
for i =1:10
for c = 1:s
for r = 1:s
H(r,c) = H(r,c) 1/(r c-1 i);
end
end
end
toc
tic;
% vectorized version
HH = zeros(s,s);
c = 1:s;
r = c';
for i=1:10
HH= HH 1./(r c-1 i);
end
toc
isequal(H,HH)
In this case any kind of "cashing" is avoided by changing of matrix H (HH) at each for-loop (over "i") iteration.
I this case we get:
Elapsed time is 3.737275 seconds. (for-loop)
Elapsed time is 1.143387 seconds. (vectorized)
So, there is still performance improvement (~ 3x) due to the vectorization, which is probably done by implicit multi-threading implementation of vectorized Matlab commands.
Yes, tic/toc vs timeit is not strictly consistent, but the overall timing functionality is very similar.