I'm currently using requests-futures for faster web scraping. The problem is, it's still very slow. Around 1 every other second. Here's how the ThreadPoolExecutor looks:
with FuturesSession(executor=ThreadPoolExecutor(max_workers=8)) as session:
futures = {session.get(url, proxies={
'http': str(random.choice(proxy_list).replace("https:/", "http:/")),
'https': str(random.choice(proxy_list).replace("https:/", "http:/")),
}, headers={
'User-Agent': str(ua.chrome),
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8',
}): url for url in url_list}
# ---
for future in as_completed(futures):
del futures[future]
try:
resp = future.result()
except:
print("Error getting result from thread. Ignoring")
try:
multiprocessing.Process(target=main_func, args=(resp,))
del resp
del future
except requests.exceptions.JSONDecodeError:
logging.warning(
"[requests.custom.debug]: requests.exceptions.JSONDecodeError: [Error] print(resp.json())")
I believe it's slow because of the as_completed for loop since that's not a concurrent loop. As for the main_func I pass the response to, that's the function that uses the information from the site using bs4. If the as_completed for loop would have been concurrent, then it would still be faster than this. I really want the scraper to be faster and I feel like I'd like to keep using requests-futures, but if there's something that's a lot faster, I'd be happy to change. So if anyone knows something that's quite a lot faster than requests-futures, then please feel free to share that
Is anyone able to help with this? Thank you
CodePudding user response:
Here's a restructure of the code which should help:
import requests
from concurrent.futures import ProcessPoolExecutor
import random
proxy_list = [
'http://107.151.182.247:80',
'http://194.5.193.183:80',
'http://88.198.50.103:8080',
'http://88.198.24.108:8080',
'http://64.44.164.254:80',
'http://47.74.152.29:8888',
'http://176.9.75.42:8080']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8'}
url_list = ['http://www.google.com', 'http://facebook.com', 'http://twitter.com']
def process(url):
proxy = random.choice(proxy_list)
https = proxy.replace('http:', 'https:')
http = proxy.replace('https:', 'http:')
proxies = {'http': http, 'https': https}
try:
(r := requests.get(url, proxies=proxies)).raise_for_status()
# call main_func here
except Exception as e:
return e
return 'OK'
def main():
with ProcessPoolExecutor() as executor:
for result in executor.map(process, url_list):
print(result)
if __name__ == '__main__':
main()
The proxy_list may not work for you. Use your own proxy list.
Obviously the url_list will not match yours.
The point of this is that each URL is handled in its own process. There really is no need to mix threads and processes in this scenario especially as it adds a degree of synchronicity while you wait for threads to complete before running a sub-process.