How to scrape websites with threads - 1 IP per thread?-CodePudding

I have 60 proxies (residential, with username and password). I want to scrape 10000 webpages. I want to rotate over the IPs, so that 1 IP per thread is used every 1 second. So every second there are 60 threads, each thread scraping 1 page.

But I just can't do it.

The best I was able to do is the below program. It does 1 IP per thread, but only for 60 pages. I want it to continue until all 10000 pages are scraped.

How can I do that? Would asyncio be a better choice?

import threading
import requests
import time
import lxml.html
import csv
from concurrent.futures import ThreadPoolExecutor


def scrape_page(html, url):
    SCRAPE STUFF FROM URL
    return LIST


def download(url, proxy):
    try:
        proxy = {"https": proxy, "http": proxy}
        r = requests.get(url, proxies=proxy, stream=True)
        r.raw_decode_content = True
        time.sleep(1)
    except Exception as err:
        print(url, "503")

    return scrape_page(r.text, url)


websites = LIST WITH 10000 SITES
ROTATING_PROXY_LIST = LIST WITH 60 PROXIES

with ThreadPoolExecutor(max_workers=60) as executor:   
    data = []

    for result in executor.map(download, websites, ROTATING_PROXY_LIST):
        data.append(result)

with open("results.csv", "w", newline="\n", encoding="utf8") as f:
    writer = csv.writer(f, delimiter="\t")
    writer.writerows(data)

CodePudding user response：

The problem is that when you write this:

executor.map(download, websites, ROTATING_PROXY_LIST)

You're effectively asking for zip(websites, ROTATING_PROXY_LIST), which will only ever be as long as the shorttest iterable. You can solve this by making ROTATING_PROXY_LIST effectively infinite:

import itertools
.
.
.
with ThreadPoolExecutor(max_workers=60) as executor:   
    data = []

    for result in executor.map(download, websites, itertools.cycle(ROTATING_PROXY_LIST)):
        data.append(result)

itertools.cycle will "Return elements from the iterable until it is exhausted. Then repeat the sequence indefinitely."