Parallel execution of prime finding algorithm slows runtime-CodePudding

So I implemented the following prime finding algorithm in go.

primes = []
Assume all numbers are primes (vacuously true)
check = 2
if check is still assumed to be prime append it to primes
multiply check by each prime less than or equal to its minimum factor and eliminate results from assumed primes.
increment check by 1 and repeat 4 thru 6 until check > limit.

Here is my serial implementation:

package main

import(
    "fmt"
    "time"
)

type numWithMinFactor struct {
    number int
    minfactor int
}

func pow(base int, power int) int{
    result := 1
    for i:=0;i<power;i  {
        result*=base
    }
    return result
}

func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor){
    var n int
    for i:=0;primes[i]<=check.minfactor;i  {
        n = check.number*primes[i]
        if n>top{
            break;
        }
        minFactors[n] = numWithMinFactor{n,primes[i]}
        if i 1 == len(primes){
            break;
        }
    }
}

func findPrimes(top int) []int{
    primes := []int{}
    minFactors := make([]numWithMinFactor,top 2)
    check := 2
    for power:=1;check <= top;power  {
        if minFactors[check].number == 0{
            primes = append(primes,check)
            minFactors[check] = numWithMinFactor{check,check}
        }
        process(minFactors[check],primes,top,minFactors)
        check  
    }
    return primes
}

func main(){ 
    fmt.Println("Welcome to prime finder!")
    start := time.Now()
    fmt.Println(findPrimes(1000000))
    elapsed := time.Since(start)
    fmt.Println("Finding primes took %s", elapsed)
}

This runs great producing all the primes <1,000,000 in about 63ms (mostly printing) and primes <10,000,000 in 600ms on my pc. Now I figure none of the numbers check such that 2^n < check <= 2^(n 1) have factors > 2^n so I can do all the multiplications and elimination for each check in that range in parallel once I have primes up to 2^n. And my parallel implementation is as follows:

package main

import(
    "fmt"
    "time"
    "sync"
)

type numWithMinFactor struct {
    number int
    minfactor int
}

func pow(base int, power int) int{
    result := 1
    for i:=0;i<power;i  {
        result*=base
    }
    return result
}

func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
    defer wg.Done()
    var n int
    for i:=0;primes[i]<=check.minfactor;i  {
        n = check.number*primes[i]
        if n>top{
            break;
        }
        minFactors[n] = numWithMinFactor{n,primes[i]}
        if i 1 == len(primes){
            break;
        }
    }
}

func findPrimes(top int) []int{
    primes := []int{}
    minFactors := make([]numWithMinFactor,top 2)
    check := 2
    var wg sync.WaitGroup
    for power:=1;check <= top;power  {
        for check <= pow(2,power){
            if minFactors[check].number == 0{
                primes = append(primes,check)
                minFactors[check] = numWithMinFactor{check,check}
            }
            wg.Add(1)
            go process(minFactors[check],primes,top,minFactors,&wg)
            check  
            if check>top{
                break;
            }
        }
        wg.Wait()
    }
    return primes
}

func main(){ 
    fmt.Println("Welcome to prime finder!")
    start := time.Now()
    fmt.Println(findPrimes(1000000))
    elapsed := time.Since(start)
    fmt.Println("Finding primes took %s", elapsed)
}

Unfortunately not only is this implementation slower running up to 1,000,000 in 600ms and up to 10 million in 6 seconds. My intuition tells me that there is potential for parallelism to improve performance however I clearly haven't been able to achieve that and would greatly appreciate any input on how to improve runtime here, or more specifically any insight as to why the parallel solution is slower.

Additionally the parallel solution consumes more memory relative to the serial solution but that is to be expected; the serial solution can grid up to 1,000,000,000 in about 22 seconds where the parallel solution runs out of memory on my system (32GB ram) going for the same target. But I'm asking about runtime here not memory use, I could for example use the zero value state of the minFactors array rather than a separate isPrime []bool true state but I think it is more readable as is.

I've tried passing a pointer for primes []int but that didn't seem to make a difference, using a channel instead of passing the minFactors array to the process function resulted in big time memory use and a much(10x ish) slower performance. I've re-written this algo a couple times to see if I could iron anything out but no luck. Any insights or suggestions would be much appreciated because I think parallelism could make this faster not 10x slower!

Par @Volker's suggestion I limited the number of processes to somthing less than my pc's available logical processes with the following revision however I am still getting runtimes that are 10x slower than the serial implementation.

package main

import(
    "fmt"
    "time"
    "sync"
)

type numWithMinFactor struct {
    number int
    minfactor int
}

func pow(base int, power int) int{
    result := 1
    for i:=0;i<power;i  {
        result*=base
    }
    return result
}

func process(check numWithMinFactor,primes []int,top int,minFactors []numWithMinFactor, wg *sync.WaitGroup){
    defer wg.Done()
    var n int
    for i:=0;primes[i]<=check.minfactor;i  {
        n = check.number*primes[i]
        if n>top{
            break;
        }
        minFactors[n] = numWithMinFactor{n,primes[i]}
        if i 1 == len(primes){
            break;
        }
    }
}

func findPrimes(top int) []int{
    primes := []int{}
    minFactors := make([]numWithMinFactor,top 2)
    check := 2
    nlogicalProcessors := 20
    var wg sync.WaitGroup
    var twoPow int
    for power:=1;check <= top;power  {
        twoPow = pow(2,power)
        for check <= twoPow{
            for nLogicalProcessorsInUse := 0 ; nLogicalProcessorsInUse < nlogicalProcessors; nLogicalProcessorsInUse  {
                if minFactors[check].number == 0{
                    primes = append(primes,check)
                    minFactors[check] = numWithMinFactor{check,check}
                }
                wg.Add(1)
                go process(minFactors[check],primes,top,minFactors,&wg)
                check  
                if check>top{
                    break;
                }
                if check>twoPow{
                    break;
                }
            }           
            wg.Wait()
            if check>top{
                break;
            }
        }
    }
    return primes
}

func main(){ 
    fmt.Println("Welcome to prime finder!")
    start := time.Now()
    fmt.Println(findPrimes(10000000))
    elapsed := time.Since(start)
    fmt.Println("Finding primes took %s", elapsed)
}

tldr; Why is my parallel implementation slower than serial implementation how do I make it faster?

Par @mh-cbon's I made larger jobs for parallel processing resulting in the following code.

package main

import(
    "fmt"
    "time"
    "sync"
)

func pow(base int, power int) int{
    result := 1
    for i:=0;i<power;i  {
        result*=base
    }
    return result
}

func process(check int,primes []int,top int,minFactors []int){
    var n int
    for i:=0;primes[i]<=minFactors[check];i  {
        n = check*primes[i]
        if n>top{
            break;
        }
        minFactors[n] = primes[i]
        if i 1 == len(primes){
            break;
        }
    }
}

func processRange(start int,end int,primes []int,top int,minFactors []int, wg *sync.WaitGroup){
    defer wg.Done()
    for start <= end{
        process(start,primes,top,minFactors)
        start  
    }
}


func findPrimes(top int) []int{
    primes := []int{}
    minFactors := make([]int,top 2)
    check := 2
    nlogicalProcessors := 10
    var wg sync.WaitGroup
    var twoPow int
    var start int
    var end int
    var stepSize int
    var stepsTaken int
    for power:=1;check <= top;power  {
        twoPow = pow(2,power)
        stepSize = (twoPow-start)/nlogicalProcessors
        stepsTaken = 0
        stepSize = (twoPow/2)/nlogicalProcessors
        for check <= twoPow{                
            start = check
            end = check stepSize
            if stepSize == 0{
                end = twoPow
            }
            if stepsTaken == nlogicalProcessors-1{
                end = twoPow
            }
            if end>top {
                end = top
            }
            for check<=end {            
                if minFactors[check] == 0{
                    primes = append(primes,check)
                    minFactors[check] = check
                }
                check  
            }
            wg.Add(1)
            go processRange(start,end,primes,top,minFactors,&wg)
            if check>top{
                break;
            }
            if check>twoPow{
                break;
            }
            stepsTaken  
            
        }
        wg.Wait()
        if check>top{
            break;
        }
    }
    return primes
}

func main(){ 
    fmt.Println("Welcome to prime finder!")
    start := time.Now()
    fmt.Println(findPrimes(1000000))
    elapsed := time.Since(start)
    fmt.Println("Finding primes took %s", elapsed)
}

This runs at a similar speed to the serial implementation.

CodePudding user response：

So I did eventually get a parallel version of the code to run slightly faster than the serial version. following suggestions from @mh-cbon (See above). However this implementation did not result in vast improvements relative to the serial implementation (50ms to 10 million compared to 75ms serially) Considering that allocating and writing an []int 0:10000000 takes 25ms I'm not disappointed by these results. As @Volker stated "such stuff often is not limited by CPU but by memory bandwidth." which I believe is the case here.

I would still love to see any additional improvements however I am somewhat satisfied with what I've gained here.

Serial code running up to 2 billion 19.4 seconds
Parallel code running up to 2 billion 11.1 seconds
Initializing []int{0:2Billion} 4.5 seconds