I want to webscraping multiple urls and parse quick as possible but the for loop is not too faster for me, have a way to do this maybe with asynchronous or multiprocessing or multithreading?
import grequests
from bs4 import BeautifulSoup
links1 = [] #multiple links
while True:
try:
reqs = (grequests.get(link) for link in links1)
resp = grequests.imap(reqs, size=25, stream=False)
for r in resp: # I WANT TO RUN THIS FOR LOOP QUICK AS POSSIBLE ITS POSSIBLE?
soup = BeautifulSoup(r.text, 'lxml')
parse = soup.find('div', class_='txt')
CodePudding user response:
Example how to use multiprocessing
with requests
/BeautifulSoup
:
import requests
from tqdm import tqdm # for pretty progress bar
from bs4 import BeautifulSoup
from multiprocessing import Pool
# some 1000 links to analyze
links1 = [
"https://en.wikipedia.org/wiki/2021_Moroccan_general_election",
"https://en.wikipedia.org/wiki/Tangerang_prison_fire",
"https://en.wikipedia.org/wiki/COVID-19_pandemic",
"https://en.wikipedia.org/wiki/Yolanda_Fernández_de_Cofiño",
] * 250
def parse(url):
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return soup.select_one("h1").get_text(strip=True)
if __name__ == "__main__":
with Pool() as p:
out = []
for r in tqdm(p.imap(parse, links1), total=len(links1)):
out.append(r)
print(len(out))
With my internet connection/CPU (Ryzen 3700x) I was able to get results from all 1000 links in 30 seconds:
100%|██████████| 1000/1000 [00:30<00:00, 33.12it/s]
1000
all my CPUs were utilized (screenshot from htop
):