This is how I solved the following question, I want to be sure if my solution is correct?
A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2 Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable
Solution : I think I'm right in thinking that 2% of the program is run at 2 GFLOPs, and 98% is run at 200 GFLOPs, and that I can average these speeds to find the performance of the multiprocessor in GFLOPs
(2/100)*2 (98/100)*200 = 196,04 Gflops
I want to be sure if my solution is correct?
CodePudding user response:
From my understanding, it is 2% of the program that is sequential and not 2% of the execution time. This means the sequential code takes a significant portion of the time since there are a lot of processor so the parallel part is drastically accelerated.
With your method, a program with 50% of sequential code and 1000 processors will run at (50/100)*2 (50/100)*2_000 = 1001 Gflops
. This means that all processors are use at ~50% of their maximum capacity in average during all the execution of the program which is too good to be possible. Indeed, the parallel part of the program should be so fast that it will take only a tiny faction of the execution time (<5%) while the sequential part will take almost all the time of the program (>95%). Since the largest part of the execution time runs at 2 Gflops, the processors cannot be used at ~50% of their capacity!
Based on the Amdahl's law, you can compute the actual speed up of this code:
Slat = 1 / ((1-p) p/s)
where Slat
is the speed up of the whole program, p
the portion of parallel code (0.98) and s
is the number of processors (100). This means Slat = 33.6
. Since one processor runs at 2 Gflops and the program is 33.6 time faster overall using many processors, the overall program runs at 33.6 * 2 = 67.2 Gflops
.
What the Amdahl's law show is that a tiny fraction of the execution time being sequential impact strongly the scalability and thus the performance of parallel programs.
CodePudding user response:
Forgive me to start light & anecdotally,
citing a meme from my beloved math professor,
later we will see why & how well it helps us here
2 7 = 15 . . . , particularly so for higher values of 7
If terms were not well defined or defined at all, prior to using them, we can all keep speaking, yet we are also sure then, no one else can under such conditions understand anyone other's idea, the less any argument laid down in support of or to counter-exemplify, why an original idea was right or why it was wrong.
This said,
and
given civilised societies prefer to take care of shared values (and must therefore, if that is to be ever achieved, had already spent time, care & efforts to first share meaning of words),
let me try a few definitions :
a) GFLOPS
is a unit, that measures how many operations in FLO-ating point arithmetics, with no particular kind thereof specified (see remark 1), were performed P-er S-econd ~ FLOPS, here expressed for convenience in multiples of billion ( G-iga ), i.e. the said GFLOPS
b) processor, multi-processor
is a device ( or some kind of composition of multiple such same devices, expressed as multi-p. ), used to perform some kind of a useful work - a processing
This pair of definitions was necessary
to further judge the asked question to solve.
The term (a) is a property of (b), irrespective of all other factors, if we assume such "device" not to be a kind of some polymorphic, self-modifying FGPA or evolutionary reflective self-evolving amoeboid, which both processors & multi-processors prefer not to be, at least in our part of Universe as we know it in 2022-Q2.
Once manufactured, each kind of processor(b) (be it a monolithic or a multi-processor device) has certain, observable, repetitively measurable, qualities of processing (doing a work).
"A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable"
A multiprocessor . . . (device) consists . . . has a property of being composed of 100 . . . (quantitative factor) ~ 100 processors,. . . (device) each . . . declaration of equality capable . . . having a property of of a peak . . . peak (not having any higher) execution . . . execution of work (process/code) rate . . . being measured in time [1/s] of 2Gflops . . . (quantitative factor) ~ 2E 9 FLOPS What is . . . Questioning the PERFORMANCE . . . (property) a term (not defined yet) of the SYSTEM . . . (system) a term (not defined yet) as measured in . . . using some measure to evaluate a property of (system) in Gflops . . . (units of measure) to express such property in when . . . (proposition) 2% . . . (quantitative factor) ~ 0.02 fraction of of the code . . . (subject-being-processed) is . . . has a property of being sequential . . . sequential, i.e. steps follow one-after-another and 98% . . . (quantitative factor) ~ 0.98 fraction of (code) ( the same code) is . . . has a property of being parallelizable . . . possible to re-factor into some other form, from a (sequential) original form
( emphasis added )
Fact #1 )
the processor(b) ( a (device) ), from which an introduced multiprocessor ( a macro-(device) ) is internally composed from, has a declared (granted) property of not being able to process more FLOPS, than the said 2 GFLOPS.
This property does not say, how many actual { INTOPS | FLOPS } it will perform in any particular moment in time.
This property does say, any device, that was measured and got labeled to have indeed X {M|G|P|E}FLOPS has the very same "glass-ceiling" of not being able to perform a single more instruction per second, even when it is doing nothing at all (chopping NOP
-s) or even when it is switched off and powered down.
This property is a static supreme, an artificial (in relation to real-world work-loads' instruction mixes), temperature-dependent-constant (and often degrades in-vivo not only due to thermal throttling but due to many other reasons in real-world { processor !processor }-composed SYSTEM ecosystems )
Fact #2 )
the problem, as visible to us here, has no particular definition of what is or what is not a part of the said "SYSTEM" - Is it just the (multi)processor - if so, then why introducing a new, not yet defined, term SYSTEM, for being it a pure identity with the already defined & used term (multi)processor per se? Is it both the (multi)processor and memory or other peripherals - if so, the why we know literally nothing about such important neighbourhood (a complement) of the said (multi)processor, without which a SYSTEM would not be The SYSTEM, but a mere part of it, the (multi)processor, that is NOT a SYSTEM without its (such a SYSTEM-defining and completing) neighbourhood?
Fact #3 )
the original Amdahl's Law, often dubbed as The Law of Diminishing Returns (of extending the System with more and more resources) speaks about SYSTEM and its re-organised forms, when comparing the same amount and composition of work, as performed in original SYSTEM (with a pure-[SERIAL] flow of operations, one-step-after-another-after-another), with another, improved SYSTEM' (created by re-organising and extending the original SYSTEM by adding more resources of some kinds and turning such a new SYSTEM' into operating more parts of the original work-to-be-done in an improved organisation of work, where more resources can & do perform parts of work-to-be-done independently one on any other one ~ in a concurrent, some parts even in a true parallel fashion, using all degrees of parallelism the SYSTEM' resources can provide & sustain to serve).
Given no particular piece of information was present about a SYSTEM, the less about a SYSTEM', we have no right to use The Law of Diminishing Returns to address the problem, as was defined above. Having no facts does not give us a right to guestimate, the less to turn into feelings-based evidencing, if we strive to remain serious to ourselves, don't we?
Given (a) and (b) above, the only fair to achieve claim, that indeed holds true, can be to say :
"From what has been defined so far,
we know that such a multiprocessor
will never work on more than 100 x 2 GFLOP per second of time."
There is zero other knowledge to claim a single bit more (and yet we still have to silently assume that such above claimed peak FLOP-s have no side-effect and remain sustainable for at least a one whole second (see remark 2 ) -- otherwise even this claim will become skewed
An extended, stronger version :
"No matter what kind of code is actually being run,
for this, above specified multiprocessor, we cannot say more
than that such a multiprocessor will never work on more than 100 x 2 GFLOPS in any moment of time."
Remarks :
see how this is being so often misused by promotion of "Exaflops performance" by marketing people, when
FMUL f8,f8
is being claimed and "sold" to the public as that it "looks" equal asFMUL f512,f512
, which it by far is not using the same yardstick to measure, is it?a similar skewed argument (if not a straight misinformation) has been countless times repeated in a (false) claim, that a world "largest" femtosecond-LASER was capable to emit a light pulse carrying more power than XY-Suns (a-WOW-moment!), without adding, how long did it take to pump-up the energy for a single such femtosecond long ( 1 [fs] ~ 1E-15 [s] ) "packet-of-a-few-photons" ... careful readers have already rejected a-"WOW"-moment artificial stupidity for not being possible to carry such astronomic amount of energy, as an energy of XY-Suns on a tiny, energy poor planet, the less to carry that "over" a wire towards that "superpower" LASER )