Consider this trivial algorithm of prime-decomposition of an integer n
: Let d'
be the divisor of n
last found. Initially, set d'=1
. Find the smallest divisor d>d'
of n
, and find the maximal value e
such that de
divides n
. Append de
to the answer and repeat the procedure on n/de
. Finally, stop when n
becomes 1. For simplicity, let's ignore mathematical optimizations, like stop at sqrt n
etc.
I have implemented it in two ways. The first one generates a list of division "attempts", and then groups the successful ones by divisors. For example, for n=20
, we first generate [(2,20),(2,10),(2,5),(3,5),(4,5),(5,5),(5,1)]
, which we then transform to the desired [(2,2),(5,1)]
using group
and other library functions.
The second implementation is an explicit recursion which keeps track of the exponent e
along the way, appends de
to the answer once the maximal e
is reached, proceeds to finding the "next" d
, and so on.
Question 1: Why does the first implementation run way slower than the second, despite the following:
- Both the implementations execute
div
, the core step of the algorithm, roughly the same number of times. - Lazy evaluation (and fusion?) has the effect that the long list illustrated above never has to be materialized in the first place. As you can see in the code below,
divTrials n
, the list I am talking about, is transformed by a chain of higher order functions. In that, I think that the partmap (\xs-> (head xs,length xs)) ... group
should tell the compiler that the list is just intermediate:
{-# OPTIONS_GHC -O2 #-}
module GroupCheck where
import Data.List
import Data.Maybe
implement1 :: Integral t=> t -> [(t,Int)] -- IMPLEMENTATION 1
implement1 = map (\xs-> (head xs,length xs)).factorGroups where
tryDiv (d,n)
| n `mod` d == 0 = (d,n `div` d)
| n == 1 = (1,1) -- hack
| otherwise = (d 1,n)
divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
factorGroups = filter (not.null).map tail.group.map fst.divTrials
implement2 :: Show t => Integral t => t -> [(t,Int)] -- IMPLEMENTATION 2
implement2 num = keep2 $ tail $ go (1,0,1,num) where
range d n = [d 1..n]
nextd d n = fromMaybe n $ find ((0==).(n`mod`)) (range d n)
update (d,e,de,n)
| n `mod` d == 0 = update (d,e 1,de*d,n`div`d)
| otherwise = (d,e,de,n)
go (d,e,de,1) = [(d,e,de,1)]
go (d,e,de,n) = (d,e,de,n) : go (update (nextd d n,0,1,n))
keep2 = map (\(d,e,_,_)->(d,e))
main :: IO ()
main = do
let n = 293872
let ans1 = implement1 n
let ans2 = implement2 n
print ans1
print ans2
Profiling tells us that tryDiv
and divTrials
together eat up >99% of the entire execution time:
> stack ghc -- -main-is GroupCheck.main -prof -fprof-auto -rtsopts GroupCheck
> ./GroupCheck RTS -p >/dev/null && cat GroupCheck.prof
GroupCheck RTS -p -RTS
total time = 18.34 secs (18338 ticks @ 1000 us, 1 processor)
total alloc = 17,561,404,568 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
implement1.divTrials GroupCheck GroupCheck.hs:12:3-69 52.6 69.2
implement1.tryDiv GroupCheck GroupCheck.hs:(8,3)-(11,25) 47.2 30.8
Question 1.5: So.. what's so bad about these functions? Also,
Question 2: In a more general case of having to aggregate contiguous blocks of identical elements from a nondecreasing sequence, should we go the bulky implement2
way if we want speed? (Again, ignoring domain-specific optimizations.)
Or did I totally miss something obvious? Thanks!
CodePudding user response:
Just to establish a baseline, I ran your program on a slightly larger starting number (so that time
didn't print out 0.00s). I chose n = 2938722345623
for no particular reason. Here's the timings before starting to tweak things:
ans1
: indistinguishable from infinity (I finished writing this entire answer and it was still running, about 26 minutes in total)
ans2
: 2.78s
The first thing to try is to tweak this line:
divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
This looks like a pretty natural definition, but it turns out that GHC never memoizes function calls. So if you want to make a list that's defined recursively in terms of itself, you must not make a function call in the recursion. Here's how:
divTrials n = xs where xs = takeWhile (/=(1,1)) $ (2,n): map tryDiv xs
Just that change brings the time down to 7.85s. Still off by a factor of about 3, but much much better.
The less obvious problem lies here:
factorGroups = filter (not.null).map tail.group.map fst.divTrials
Putting the group
so early breaks fusion, causing that intermediate list to actually be materialized. This means allocating and deallocating a lot of cons cells and tuples. Here's an implementation that has the same spirit, but puts more work before the group
:
tryDiv d n
| n `mod` d == 0 = d : tryDiv d (n `div` d)
| n == 1 = []
| otherwise = tryDiv (d 1) n
factorGroups = group . tryDiv 2
With that, we are down to 2.65s -- slightly faster than ans2
, though I only did one test of each so it's pretty likely to just be measurement noise.