Home > Net >  How to parse the response from Grequests faster?
How to parse the response from Grequests faster?

Time:09-17

I want to webscraping multiple urls and parse quick as possible but the for loop is not too faster for me, have a way to do this maybe with asynchronous or multiprocessing or multithreading?

import grequests
from bs4 import BeautifulSoup


links1 = [] #multiple links


while True:
  try:  
 
   reqs = (grequests.get(link) for link in links1)
   resp = grequests.imap(reqs, size=25, stream=False)
  

   for r in resp:     # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE? 
    soup = BeautifulSoup(r.text, 'lxml') 
    parse = soup.find('div', class_='txt')

CodePudding user response:

Example how to use multiprocessing with requests/BeautifulSoup:

import requests
from tqdm import tqdm  # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool

# some 1000 links to analyze
links1 = [
    "https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
    "https://en.wikipedia.org/wiki/Tangerang_prison_fire",
    "https://en.wikipedia.org/wiki/COVID-19_pandemic",
    "https://en.wikipedia.org/wiki/Yolanda_Fernández_de_Cofiño",
] * 250


def parse(url):
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    return soup.select_one("h1").get_text(strip=True)


if __name__ == "__main__":
    with Pool() as p:
        out = []
        for r in tqdm(p.imap(parse, links1), total=len(links1)):
            out.append(r)

    print(len(out))

With my internet connection/CPU (Ryzen 3700x) I was able to get results from all 1000 links in 30 seconds:

100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000

all my CPUs were utilized (screenshot from htop):

enter image description here

  • Related