Scraping thousand of urls-CodePudding

I have a function that scrape a list of urls, 200k urls, it took a lot of time, there is a way to speed up this process?

def get_odds(ids):
  headers = {"Referer": "https://www.betexplorer.com",
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
  s = requests.Session()
  matches=[]
  for id in ids:
    url = f'https://www.betexplorer.com{id}'
    response = s.get(url, headers=headers)
    soup = BeautifulSoup(response.text,'html.parser')
    season = url.split('/')[5]

    "do stuff.."

ids is a list

['/soccer/england/premier-league/brentford-norwich/vyLXXRcE/'
...]

CodePudding user response：

Yes, you can use multiprocessing.

Something like:

from multiprocessing import Pool

if __name__ == "__main__":
    threads = 10 # The number of concurrent requests
    p = Pool(threads)
    p.map(get_odds, ids)
    p.terminate()

Where ids is the list of ids, and get_odds is the function you supplied, but modified to operate on JUST ONE of the ids. Keep in mind that you will be hitting their server with 10 requests at a time, and this can lead to temporary ip blocking (as you are seen as hostile). You should be mindful of that and adjust the pool size and or add sleep() logic.

The get odds function should be like:

def get_odds(id):
  headers = {"Referer": "https://www.betexplorer.com",
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
  s = requests.Session()
  matches=[]
  url = f'https://www.betexplorer.com{id}'
  response = s.get(url, headers=headers)
  soup = BeautifulSoup(response.text,'html.parser')
  season = url.split('/')[5]

  "do stuff.."

CodePudding user response：

You could let them run in parallel via Multithreading. E.g. creating 10 threads and based on the ending of your id (0, 1, 2, 3, ...) knows the thread which ID it should scrape. Does only work with enough computing power and stable internet connection.

EDIT: Since ID is a list, check the index instead to identify which thread should scrape which website.