I have 60 proxies (residential, with username and password). I want to scrape 10000 webpages. I want to rotate over the IPs, so that 1 IP per thread is used every 1 second. So every second there are 60 threads, each thread scraping 1 page.
But I just can't do it.
The best I was able to do is the below program. It does 1 IP per thread, but only for 60 pages. I want it to continue until all 10000 pages are scraped.
How can I do that? Would asyncio be a better choice?
import threading
import requests
import time
import lxml.html
import csv
from concurrent.futures import ThreadPoolExecutor
def scrape_page(html, url):
SCRAPE STUFF FROM URL
return LIST
def download(url, proxy):
try:
proxy = {"https": proxy, "http": proxy}
r = requests.get(url, proxies=proxy, stream=True)
r.raw_decode_content = True
time.sleep(1)
except Exception as err:
print(url, "503")
return scrape_page(r.text, url)
websites = LIST WITH 10000 SITES
ROTATING_PROXY_LIST = LIST WITH 60 PROXIES
with ThreadPoolExecutor(max_workers=60) as executor:
data = []
for result in executor.map(download, websites, ROTATING_PROXY_LIST):
data.append(result)
with open("results.csv", "w", newline="\n", encoding="utf8") as f:
writer = csv.writer(f, delimiter="\t")
writer.writerows(data)
CodePudding user response:
The problem is that when you write this:
executor.map(download, websites, ROTATING_PROXY_LIST)
You're effectively asking for zip(websites, ROTATING_PROXY_LIST)
, which will only ever be as long as the shorttest iterable. You can solve this by making ROTATING_PROXY_LIST
effectively infinite:
import itertools
.
.
.
with ThreadPoolExecutor(max_workers=60) as executor:
data = []
for result in executor.map(download, websites, itertools.cycle(ROTATING_PROXY_LIST)):
data.append(result)
itertools.cycle
will "Return elements from the iterable until it is exhausted. Then repeat the sequence indefinitely."