Home > Software engineering >  How can apply multithreading to speed up the beautiful soup to scrape data fast
How can apply multithreading to speed up the beautiful soup to scrape data fast

Time:06-16

I am not familiar with multithreading and how I can apply it to scrape the data fast because beautifulsoup scrape the data slow can tell how I apply multithreading to my code this is the page link https://baroul-timis.ro/tabloul-avocatilor/

import requests
from bs4 import BeautifulSoup
import pandas as pd



url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"

base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

productlink=[]
data = requests.get(url).json()
for i, d in enumerate(data["data"], 1):
    link = BeautifulSoup(d["actions"], "html.parser").a["href"]
    comp=base_url link
    productlink.append(comp)
test=[]   
for link in productlink:
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        
        
        try:
            phone=pip.select('span',class_='font-weight-bolder')[2].text
           
        except:
            pass
        wev['phone']=phone.split('\xa0')
        
        
        
        
        
        
        try:
            email=pip.select('span',class_='font-weight-bolder')[3].text
        except:
            pass
        wev['email']=email.split('\xa0')
        
        
        
        test.append(wev)
        
        
        
df = pd.DataFrame(test)
print(df)

CodePudding user response:

Multithreading is ideal for this kind of thing because there will be lots of I/O waits while the URLs are accessed and their data acquired. Here's how you could re-work it:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

test = []

def process(link):
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'lxml')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        try:
            phone=pip.select('span',class_='font-weight-bolder')[2].text
        except:
            pass
        wev['phone']=phone.split('\xa0')
        try:
            email=pip.select('span',class_='font-weight-bolder')[3].text
        except:
            pass
        wev['email']=email.split('\xa0')
        test.append(wev)

productlink=[]
data = requests.get(url).json()
for d in data["data"]:
    link = BeautifulSoup(d["actions"], "lxml").a["href"]
    productlink.append(base_url link)

with ThreadPoolExecutor() as executor:
    executor.map(process, productlink)

df = pd.DataFrame(test)
print(df)

This generates a 941 row dataframe in <44 seconds on my system (24 threads) - i.e., ~20 URLs/second

Note: If you don't already have lxml installed, you'll need it. It's generally faster than html.parser

CodePudding user response:

You can use ThreadPoolExecutor if you want to use threading.

from concurrent.futures import ThreadPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 10 threads at a time
executor = ThreadPoolExecutor(8)
results = executor.map(do_work, links)

I recommend to use ProcessPoolExecutor. This not only works with IO but also with CPU bound tasks. It'll also use all your CPUs.

from concurrent.futures import ProcessPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 8 processes parallel.
executor = ProcessPoolExecutor(8)
results = executor.map(do_work, links, chunksize=40)

Here the results will be the list of return values from the do_work function. It's better not to return a huge data. Then serializing that data will make the process very slow. Instead save it to db or file.

Read more about the concurrent.futures

  • Related