I am running one code where the python script is basically scraping through a list of links.
But the process is too slow.I want to divide my code into several process where the code is simultaneously scrapping through multiple links at once.
List of Links is almost 5000.
Here is my code which I want to run in parallel
#links contains list of links
def fun():
for link in links:
requests.get(link,timeout=5)
###... scraping code
#####
CodePudding user response:
If you want to make more requests at the same time, you don't want to use requests
, but AIOHTTP
instead.
The package allows you to make HTTP
requests asynchronously.
CodePudding user response:
I suggest build a multithreaded program to make requests. concurrent.futures
is one of the easiest ways to multithread these kinds of requests, in particular using the ThreadPoolExecutor. They even have a simple multithreaded URL request example in the documentation.
here is a sample code using bs4
and concurrent.futures
import time
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
URLs = [ ... ] # A long list of URLs.
def parse(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
return soup.find_all('a')
# run 10 workers concurrently, it depends on the number of core/threads of your processor
with ThreadPoolExecutor(max_workers=10) as executor:
start = time.time()
futures = [ executor.submit(parse, url) for url in URLs ]
results = []
for result in as_completed(futures):
results.append(result)
end = time.time()
print("Time Taken: {:.6f}s".format(end-start))
Also, you may want to check out python scrapy framework, it will scrape the data concurrently and very easy to learn, also it comes with many features such as auto-throttle, rotating proxies and user-agents, you can easily integrate with your databases as well.