Home > database >  Run time is roughly 12 seconds, with and without multiprocessing. Shouldnt multiprocessing be faster
Run time is roughly 12 seconds, with and without multiprocessing. Shouldnt multiprocessing be faster

Time:03-10

I have this web scraper, that scrapes the price of 4 different metals. The run time as mentioned about 12 seconds multiprocessing or not. Shouldnt this code run the function 4 times roughly at the same time, and just about take off 75% of the run time off? My processor has 4 cores and 4 threads if that has something to do with it.

def scraper(url,metal):
    global aluPriser
    global zinkPriser
    global messingPriser
    global kobberPriser
    global tal
    url.status_code
    url.headers
    c = url.content
    soup = BeautifulSoup(c, "html.parser")
    samples = soup.find_all("td",style="text-align:center;white-space:nowrap;border-left:solid black 1px")
    for a in samples:
        for b in a:
            if b.startswith("$"):
                b = b.replace(" ","")
                b = b.replace("$","")
                b = int(b)
                tal.append(b)

I run this code with the following multiprocessing code:

if __name__ == '__main__':
    url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/aluminiumpriser?s=0"
    url = requests.get(url)
    whatDate(url)
    p1 = Process(target=scraper(url,"alu"))
    p1.start()


    url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/kobber?s=0"
    url = requests.get(url)
    p2 = Process(target=scraper(url,"kobber"))
    p2.start()


    url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/metal-priser-mp58?s=0"
    url = requests.get(url)
    p3 = Process(target=scraper(url,"messing"))
    p3.start()


    url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/zink?s=0"
    url = requests.get(url)
    p4 = Process(target=scraper(url,"zink"))

    p4.start()
    p1.join()
    p2.join()
    p3.join()
    p4.join()

CodePudding user response:

To get any real benefit from parallelization here, you need to move the requests.get() into the scraper function. Almost all your time is spent doing network requests; parallelizing CPU-bound bits doesn't matter if almost no time is spent in it.

That said, multiprocessing is also the wrong tool for this particular job: You pay more in serialization/deserialization costs than you gain from not having GIL contention. Use threading instead.

  • Related