Home > Net >  How do I limit the rate of a scraper?
How do I limit the rate of a scraper?

Time:07-05

So I am trying to create a table by scraping hundreds of similar pages at a time and then saving them into the same Excel table, with something like this:

#let urls be a list of hundreds of different URLs
def save_table(urls):
   <define columns and parameters of the dataframe to be saved, df>
   writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
   for i in range(0, len(urls)):
       #here, return_html_soup is the function returning the html soup of any individual URL 
       soup = return_html_soup(urls[i])
       temp_table = some_function(soup)
       df = df.append(temp_table, ignore_index = True)

   #I chose to_excel instead of to_csv here because there are certain letters on the 
   #original website that don't show up in a CSV
   df.to_excel(writer, sheet_name = <some name>)
   writer.save()
   writer.close()

I now hit the HTTP Error 429: too many requests, without any retry-after header.

Is there a way for me to get around this? I know that this error happens because I've basically asked to scrape too many websites in too short of an interval. Is there a way for me to limit the rate that my code opens links?

CodePudding user response:

Python official documentation is the best place to go: https://docs.python.org/3/library/time.html#time.sleep

Here an example using 5 seconds. But you can customize it according to what you need and the restrictions you have.

import time


#let urls be a list of hundreds of different URLs
def save_table(urls):
   <define columns and parameters of the dataframe to be saved, df>
   writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
   for i in range(0, len(urls)):
       #here, return_html_soup is the function returning the html soup of any individual URL 
       soup = return_html_soup(urls[i])
       temp_table = some_function(soup)
       df = df.append(temp_table, ignore_index = True)

       #New cote to wait for some time
       time.sleep(5)

   #I chose to_excel instead of to_csv here because there are certain letters on the 
   #original website that don't show up in a CSV
   df.to_excel(writer, sheet_name = <some name>)
   writer.save()
   writer.close()

CodePudding user response:

The way I did it, scrape the whole element once and then parse through the header, tag, or name. used bs4 with robinstocks for market data, runs every 10 mins or so, and works fine. specifically the get_element_by_name functionality. or maybe just use time delay from the time lib

  • Related