I have this web scraper, that scrapes the price of 4 different metals. The run time as mentioned about 12 seconds multiprocessing or not. Shouldnt this code run the function 4 times roughly at the same time, and just about take off 75% of the run time off? My processor has 4 cores and 4 threads if that has something to do with it.
def scraper(url,metal):
global aluPriser
global zinkPriser
global messingPriser
global kobberPriser
global tal
url.status_code
url.headers
c = url.content
soup = BeautifulSoup(c, "html.parser")
samples = soup.find_all("td",style="text-align:center;white-space:nowrap;border-left:solid black 1px")
for a in samples:
for b in a:
if b.startswith("$"):
b = b.replace(" ","")
b = b.replace("$","")
b = int(b)
tal.append(b)
I run this code with the following multiprocessing code:
if __name__ == '__main__':
url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/aluminiumpriser?s=0"
url = requests.get(url)
whatDate(url)
p1 = Process(target=scraper(url,"alu"))
p1.start()
url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/kobber?s=0"
url = requests.get(url)
p2 = Process(target=scraper(url,"kobber"))
p2.start()
url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/metal-priser-mp58?s=0"
url = requests.get(url)
p3 = Process(target=scraper(url,"messing"))
p3.start()
url = "https://www.alumeco.dk/viden-og-teknik/metalpriser/zink?s=0"
url = requests.get(url)
p4 = Process(target=scraper(url,"zink"))
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
CodePudding user response:
To get any real benefit from parallelization here, you need to move the requests.get()
into the scraper
function. Almost all your time is spent doing network requests; parallelizing CPU-bound bits doesn't matter if almost no time is spent in it.
That said, multiprocessing
is also the wrong tool for this particular job: You pay more in serialization/deserialization costs than you gain from not having GIL contention. Use threading
instead.