I've been trying to parallelize my Haskell code and it has just been getting slower, so i made some sample code to show my problem here is the serial code:
module Main where
import System.Environment
sumRangeSquares :: (Num a, Enum a) => a -> a -> a
sumRangeSquares start end = sum $ map (^2) [start .. end]
main :: IO ()
main = do
[start, end] <- map read <$> getArgs
print $ sumRangeSquares start end
Compiled with stack ghc -- -O2 -rtsopts -eventlog -threaded src/Main.hs
and ran with time ./src/Main 1 10000000
, it completes in about 0.4 seconds
Now the obvious parallel counterpart is:
module Main where
import Control.Parallel.Strategies
import System.Environment
sumRangeSquares :: (Num a, Enum a) => a -> a -> a
sumRangeSquares start end = sum $ parMap rseq (^2) [start .. end]
main :: IO ()
main = do
[start, end] <- map read <$> getArgs
print $ sumRangeSquares start end
Compiled the same way and ran with time ./src/Main 1 10000000 RTS -N4 -lf -s
takes over 6 seconds
Here's the log created by -s
:
2,661,959,552 bytes allocated in the heap
1,891,228,032 bytes copied during GC
468,753,512 bytes maximum residency (12 sample(s))
307,102,616 bytes maximum slop
1226 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1837 colls, 1837 par 10.483s 2.705s 0.0015s 0.0080s
Gen 1 12 colls, 11 par 5.157s 1.391s 0.1159s 0.5573s
Parallel GC work balance: 26.09% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 10000000 (9998153 converted, 1847 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.038s ( 0.038s elapsed)
MUT time 6.995s ( 2.158s elapsed)
GC time 15.639s ( 4.096s elapsed)
EXIT time 0.001s ( 0.005s elapsed)
Total time 22.673s ( 6.297s elapsed)
Alloc rate 380,577,209 bytes per MUT second
Productivity 30.8% of total user, 34.3% of total elapsed
real 0m6.374s
user 0m16.889s
sys 0m5.859s
And here is the event log as seen in threadscope Main.eventlog
.
As shown in the image, there is a lot of idle time and all four HECs run and idle at relatively the same times. Furthermore, there's lots of long idle times, and unbalanced spark pools and spark creations.
CodePudding user response:
The cost of creating a new CPU thread is high and you are requesting to create a new thread for every tiny computation. The product of two integer costs much less then creating a new thread. So your machine is busy creating and killing new threads instead of doing useful work.
When you have a CPU, you have to give it a small amount of expensive jobs to get a performance boost.
This is, maybe awkward, but sufficient example: we leave sumRangeSquare
the same as in sequential variant and split our range into 4 pieces, then run 4 parallel threads with sumRangeSquares
, then sum 4 outputs in final result.
module Main where
import Control.Parallel.Strategies
import System.Environment
sumRangeSquares :: (Integer, Integer) -> Integer
sumRangeSquares (start, end) = sum $ map (^2) [start .. end]
main :: IO ()
main = do
[start, end] <- map (read :: (String -> Integer)) <$> getArgs
let space = [(start (i-1)*(div (end-start) 4), start i*(div (end-start) 4)) | i <- [1..3]]
print $ sum $ parMap rseq sumRangeSquares (space [(snd $ last space, end)])
I used 1 and 30 000 000 as args to get more significant result and I have this for you sequential variant:
time ./app/Main 1 30000000
real 0m1,353s
user 0m1,350s
sys 0m0,004s
This for my parallel, run with one thread:
time ./app/Main 1 30000000 RTS -N1 -lf
real 0m1,334s
user 0m1,311s
sys 0m0,022s
This for my parallel, run with four threads:
time ./app/Main 1 30000000 RTS -N4 -lf
real 0m0,416s
user 0m1,386s
sys 0m0,024s