I am working on a web scraping project willing to take prices from a website using different urls. I have run the following code but it takes so long to print the price number. I am using PyCharm on a MacBook Pro 13'' i5 (2020) 1.4 GHz and 8GB RAM, if this can help.
import ssl
import bs4
from urllib.request import Request, urlopen
import json
#to avoid SSL verification
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Define the url to monitor
urls = ['https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/', 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/']
for i in urls:
#Open the url to monitor using a new user agent to avoid website blocks you
req = Request(
url=i,
headers={'User-Agent': 'Mozilla/5.0'}
)
#Read the HTML code of the url
webpage = urlopen(req, context=ctx).read()
soup = bs4.BeautifulSoup(webpage, "html.parser")
#Define the HTML element we need to screen and find prices (this time using Javascript)
data = json.loads(soup.find_all('script', {'type': 'application/ld json'})[-1].get_text())
price = int(data['offers']['price'])
print(price)
Using only one url, the code works, but adding other urls and a simple for loop, it takes a while. How could I speed up the process? Thanks a lot!
CodePudding user response:
You can speed up the processing using multi-threading or multi-processing. This example will use multiprocessing
module (with Pool of 4 processes) to obtain the prices:
import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
def get_price(url):
soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
data = json.loads(soup.find_all('script', {'type': 'application/ld json'})[-1].get_text())
return url, int(data['offers']['price'])
if __name__ == '__main__':
urls = ['https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/', 'https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/']
with Pool(processes=4) as pool:
for url, price in pool.imap_unordered(get_price, urls):
print(url, price)
Prints (for example, the order could vary):
https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-t-smile-pendant-35189459/ 920
https://www.tiffany.co.uk/jewelry/necklaces-pendants/tiffany-hardwear-graduated-link-necklace-63008966/ 13900