How can apply multithreading to speed up the beautiful soup to scrape data fast-CodePudding

I am not familiar with multithreading and how I can apply it to scrape the data fast because beautifulsoup scrape the data slow can tell how I apply multithreading to my code this is the page link https://baroul-timis.ro/tabloul-avocatilor/

import requests
from bs4 import BeautifulSoup
import pandas as pd



url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"

base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

productlink=[]
data = requests.get(url).json()
for i, d in enumerate(data["data"], 1):
    link = BeautifulSoup(d["actions"], "html.parser").a["href"]
    comp=base_url link
    productlink.append(comp)
test=[]   
for link in productlink:
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        
        
        try:
            phone=pip.select('span',class_='font-weight-bolder')[2].text
           
        except:
            pass
        wev['phone']=phone.split('\xa0')
        
        
        
        
        
        
        try:
            email=pip.select('span',class_='font-weight-bolder')[3].text
        except:
            pass
        wev['email']=email.split('\xa0')
        
        
        
        test.append(wev)
        
        
        
df = pd.DataFrame(test)
print(df)

CodePudding user response：

Multithreading is ideal for this kind of thing because there will be lots of I/O waits while the URLs are accessed and their data acquired. Here's how you could re-work it:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

test = []

def process(link):
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'lxml')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        try:
            phone=pip.select('span',class_='font-weight-bolder')[2].text
        except:
            pass
        wev['phone']=phone.split('\xa0')
        try:
            email=pip.select('span',class_='font-weight-bolder')[3].text
        except:
            pass
        wev['email']=email.split('\xa0')
        test.append(wev)

productlink=[]
data = requests.get(url).json()
for d in data["data"]:
    link = BeautifulSoup(d["actions"], "lxml").a["href"]
    productlink.append(base_url link)

with ThreadPoolExecutor() as executor:
    executor.map(process, productlink)

df = pd.DataFrame(test)
print(df)

This generates a 941 row dataframe in <44 seconds on my system (24 threads) - i.e., ~20 URLs/second

Note: If you don't already have lxml installed, you'll need it. It's generally faster than html.parser

CodePudding user response：

You can use ThreadPoolExecutor if you want to use threading.

from concurrent.futures import ThreadPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 10 threads at a time
executor = ThreadPoolExecutor(8)
results = executor.map(do_work, links)

I recommend to use ProcessPoolExecutor. This not only works with IO but also with CPU bound tasks. It'll also use all your CPUs.

from concurrent.futures import ProcessPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 8 processes parallel.
executor = ProcessPoolExecutor(8)
results = executor.map(do_work, links, chunksize=40)

Here the results will be the list of return values from the do_work function. It's better not to return a huge data. Then serializing that data will make the process very slow. Instead save it to db or file.